apache / iceberg-rust

Apache Iceberg
https://rust.iceberg.apache.org/
Apache License 2.0
674 stars 159 forks source link

Implement `TableProviderFactory` for a `IcebergTableFactory` #586

Closed matthewmturner closed 3 weeks ago

matthewmturner commented 2 months ago

I would like to be able to register iceberg tables with datafusion like so:

CREATE EXTERNAL TABLE my_table STORED AS ICEBERGTABLE LOCATION '/path/to/table';

If Iceberg provided a TableProviderFactory then we could register and use that (This is how i currently register deltalake tables).

liurenjie1024 commented 2 months ago

I think this is feasible because we have supported loading table from file systems directly: https://github.com/apache/iceberg-rust/blob/a1ec0fa98113fc02b2479d81759fccd9dab10378/crates/iceberg/src/table.rs#L241

And we can also utilize current table provider implementation.

yukkit commented 2 months ago

Please assign it to me

yukkit commented 2 months ago

Here are a few questions that need clarification:

  1. Is there a need to support specifying a version? @matthewmturner
  2. Where should the parameters for object storage, such as access-key-id and secret-access-key for S3, be obtained from? @liurenjie1024
    • StaticTable relies on FileIO.
    • For those using DataFusion, they can pre-construct an AmazonS3 ObjectStore and register it in RuntimeEnv. In the implementation of TableProviderFactory, the ObjectStore can be retrieved from the RuntimeEnv of the Session based on the location. However, FileIO cannot use ObjectStore.

Proposed Solution

For the first question:

  1. By default, use the latest version.
  2. [Optional] Support specifying a version, for example: CREATE EXTERNAL TABLE my_table STORED AS ICEBERGTABLE LOCATION '/path/to/table' OPTIONS ('version' '1');

For the second question:

  1. Specify the relevant parameters in the OPTIONS block of the DDL statement.
  2. Refactor FileIO so that it also can rely on ObjectStore as its storage backend.
  3. Register the FileIO used to read the location via a global variable.
matthewmturner commented 2 months ago

@yukkit I am actually only familiar with the ObjectStore abstraction and not FileIO - ill need to look into that more. Prior to knowing that I had naively expected to leverage ObjectStore auth capabilities.

Regarding the version - im less familiar with iceberg so will defer to the iceberg teams recommendation.

liurenjie1024 commented 2 months ago

Hi, @yukkit Thanks for your interest and welcome to contribute!

For the first question:

  1. By default, use the latest version.
  2. [Optional] Support specifying a version, for example: CREATE EXTERNAL TABLE my_table STORED AS ICEBERGTABLE LOCATION '/path/to/table' OPTIONS ('version' '1');

As with this problem, how about we ask user to point the path to a table metadata file directly? The path + version approach is not clear about what this version means. For example, if the user is trying to reading a table managed by sql/hive catalogs, metadata file names are typically suffixed by uuids. Also version maybe confusing to users familiar with iceberg, since that typicall means snapshot version.

For the second question:

  1. Specify the relevant parameters in the OPTIONS block of the DDL statement.
  2. Refactor FileIO so that it also can rely on ObjectStore as its storage backend.
  3. Register the FileIO used to read the location via a global variable.

I'm +1 for option 1. While 2 and 3 are worth dicussiong options for improving FileIO, this use case doesn't seem a solid motivation. Also we don't need to ask user to provide credentials in config block, since currently opendal s3 operators could load for env: https://docs.rs/opendal/latest/opendal/services/struct.S3Config.html#structfield.disable_config_load . cc @Xuanwo to confirm.

yukkit commented 2 months ago

@liurenjie1024 Thank you so much for your feedback! I agree with your points, and I will start working on it based on this approach.

mkarbo commented 3 weeks ago

Will this enable insert into as well for iceberg?

yukkit commented 3 weeks ago

Will this enable insert into as well for iceberg?

@mkarbo This feature is implemented solely to enable reading Iceberg data as an external table in DataFusion.