Closed matthewmturner closed 3 weeks ago
I think this is feasible because we have supported loading table from file systems directly: https://github.com/apache/iceberg-rust/blob/a1ec0fa98113fc02b2479d81759fccd9dab10378/crates/iceberg/src/table.rs#L241
And we can also utilize current table provider implementation.
Please assign it to me
Here are a few questions that need clarification:
access-key-id
and secret-access-key
for S3, be obtained from? @liurenjie1024
StaticTable
relies on FileIO
.AmazonS3 ObjectStore
and register it in RuntimeEnv
. In the implementation of TableProviderFactory
, the ObjectStore
can be retrieved from the RuntimeEnv
of the Session
based on the location
. However, FileIO
cannot use ObjectStore
.Proposed Solution
For the first question:
CREATE EXTERNAL TABLE my_table STORED AS ICEBERGTABLE LOCATION '/path/to/table' OPTIONS ('version' '1');
For the second question:
OPTIONS
block of the DDL statement.FileIO
so that it also can rely on ObjectStore
as its storage backend.@yukkit I am actually only familiar with the ObjectStore
abstraction and not FileIO
- ill need to look into that more. Prior to knowing that I had naively expected to leverage ObjectStore
auth capabilities.
Regarding the version - im less familiar with iceberg so will defer to the iceberg teams recommendation.
Hi, @yukkit Thanks for your interest and welcome to contribute!
For the first question:
- By default, use the latest version.
- [Optional] Support specifying a version, for example: CREATE EXTERNAL TABLE my_table STORED AS ICEBERGTABLE LOCATION '/path/to/table' OPTIONS ('version' '1');
As with this problem, how about we ask user to point the path to a table metadata file directly? The path + version approach is not clear about what this version means. For example, if the user is trying to reading a table managed by sql/hive catalogs, metadata file names are typically suffixed by uuids. Also version
maybe confusing to users familiar with iceberg, since that typicall means snapshot version.
For the second question:
- Specify the relevant parameters in the OPTIONS block of the DDL statement.
- Refactor FileIO so that it also can rely on ObjectStore as its storage backend.
- Register the FileIO used to read the location via a global variable.
I'm +1 for option 1. While 2 and 3 are worth dicussiong options for improving FileIO, this use case doesn't seem a solid motivation. Also we don't need to ask user to provide credentials in config block, since currently opendal s3 operators could load for env: https://docs.rs/opendal/latest/opendal/services/struct.S3Config.html#structfield.disable_config_load . cc @Xuanwo to confirm.
@liurenjie1024 Thank you so much for your feedback! I agree with your points, and I will start working on it based on this approach.
Will this enable insert into
as well for iceberg?
Will this enable
insert into
as well for iceberg?
@mkarbo This feature is implemented solely to enable reading Iceberg data as an external table in DataFusion.
I would like to be able to register iceberg tables with datafusion like so:
If Iceberg provided a
TableProviderFactory
then we could register and use that (This is how i currently register deltalake tables).