apache / iceberg-rust

Apache Iceberg
https://rust.iceberg.apache.org/
Apache License 2.0
592 stars 134 forks source link

Implement `TableProviderFactory` for a `IcebergTableFactory` #586

Open matthewmturner opened 2 weeks ago

matthewmturner commented 2 weeks ago

I would like to be able to register iceberg tables with datafusion like so:

CREATE EXTERNAL TABLE my_table STORED AS ICEBERGTABLE LOCATION '/path/to/table';

If Iceberg provided a TableProviderFactory then we could register and use that (This is how i currently register deltalake tables).

liurenjie1024 commented 2 weeks ago

I think this is feasible because we have supported loading table from file systems directly: https://github.com/apache/iceberg-rust/blob/a1ec0fa98113fc02b2479d81759fccd9dab10378/crates/iceberg/src/table.rs#L241

And we can also utilize current table provider implementation.

yukkit commented 2 weeks ago

Please assign it to me

yukkit commented 2 weeks ago

Here are a few questions that need clarification:

  1. Is there a need to support specifying a version? @matthewmturner
  2. Where should the parameters for object storage, such as access-key-id and secret-access-key for S3, be obtained from? @liurenjie1024
    • StaticTable relies on FileIO.
    • For those using DataFusion, they can pre-construct an AmazonS3 ObjectStore and register it in RuntimeEnv. In the implementation of TableProviderFactory, the ObjectStore can be retrieved from the RuntimeEnv of the Session based on the location. However, FileIO cannot use ObjectStore.

Proposed Solution

For the first question:

  1. By default, use the latest version.
  2. [Optional] Support specifying a version, for example: CREATE EXTERNAL TABLE my_table STORED AS ICEBERGTABLE LOCATION '/path/to/table' OPTIONS ('version' '1');

For the second question:

  1. Specify the relevant parameters in the OPTIONS block of the DDL statement.
  2. Refactor FileIO so that it also can rely on ObjectStore as its storage backend.
  3. Register the FileIO used to read the location via a global variable.
matthewmturner commented 2 weeks ago

@yukkit I am actually only familiar with the ObjectStore abstraction and not FileIO - ill need to look into that more. Prior to knowing that I had naively expected to leverage ObjectStore auth capabilities.

Regarding the version - im less familiar with iceberg so will defer to the iceberg teams recommendation.

liurenjie1024 commented 1 week ago

Hi, @yukkit Thanks for your interest and welcome to contribute!

For the first question:

  1. By default, use the latest version.
  2. [Optional] Support specifying a version, for example: CREATE EXTERNAL TABLE my_table STORED AS ICEBERGTABLE LOCATION '/path/to/table' OPTIONS ('version' '1');

As with this problem, how about we ask user to point the path to a table metadata file directly? The path + version approach is not clear about what this version means. For example, if the user is trying to reading a table managed by sql/hive catalogs, metadata file names are typically suffixed by uuids. Also version maybe confusing to users familiar with iceberg, since that typicall means snapshot version.

For the second question:

  1. Specify the relevant parameters in the OPTIONS block of the DDL statement.
  2. Refactor FileIO so that it also can rely on ObjectStore as its storage backend.
  3. Register the FileIO used to read the location via a global variable.

I'm +1 for option 1. While 2 and 3 are worth dicussiong options for improving FileIO, this use case doesn't seem a solid motivation. Also we don't need to ask user to provide credentials in config block, since currently opendal s3 operators could load for env: https://docs.rs/opendal/latest/opendal/services/struct.S3Config.html#structfield.disable_config_load . cc @Xuanwo to confirm.

yukkit commented 1 week ago

@liurenjie1024 Thank you so much for your feedback! I agree with your points, and I will start working on it based on this approach.