apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
385 stars 141 forks source link

Accessing S3 Express one zone bucket from pyiceberg #928

Open munip opened 2 months ago

munip commented 2 months ago

Question

I have been able to access a S3 bucket with pyIceberg using SqlCatalog successfully with catalog = SqlCatalog( "default", { "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db", "warehouse": "s3://myicebergbkt/test", "s3.access-key-id": "myid", "s3.secret-access-key": "mykey", "s3.session-token":"my-token" "s3.region": "us-east-1" }, ) But, when I try accessing the same with S3 express one bucket, I am stuck on the syntax. Tried all options with no luck: catalog = SqlCatalog( "default", { "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db", "warehouse": "s3://us-east-1:730335207565:bucket/pyicebkt--use1-az4--x-s3/test", # I have also tried 730335207565:bucket/pyicebkt--use1-az4--x-s3 and just pyicebkt--use1-az4--x-s3 with no lcuk "s3.access-key-id": "myid", "s3.secret-access-key": "mykey", "s3.session-token":"my-token" "s3.region": "us-east-1" }, )

I get the error : " Expected an S3 object path of the form 'bucket/key...', got a URI: " Is S3 express one zone supported? If so, what is the syntax for warehouse variable?

kevinjqliu commented 2 months ago

I think the error might be coming from the underlying pyarrow.fs.S3FileSystem class which is used to interact with s3 https://arrow.apache.org/docs/python/generated/pyarrow.fs.S3FileSystem.html Not sure if this currently supports S3 Express One Zone right now.

According to this thread, PyArrow does not currently support it https://github.com/lancedb/lancedb/issues/1206

muniatl commented 2 months ago

Thanks kevinjqliu. From the other thread it doesn't look like pyarrow supports S3 express one. Does anyone know timelines for Express One Zone support?

Fokko commented 2 months ago

@muniatl The best place to reach out would be the Arrow mailing list: https://lists.apache.org/list.html?dev@arrow.apache.org

kevinjqliu commented 2 months ago

Arrow mailing list would be a good place to start.

PyIceberg depends on pyarrow to support s3 express one zone. I've found https://github.com/apache/arrow-rs/issues/5140 which adds support for the arrow rust library. It'll be great to open an issue with pyarrow to track support for s3 express one zone.