Closed djouallah closed 2 weeks ago
+1 for sensible defaults.
in some scenarios, the warehouse location is set to base_warehouse_location/database_name/table_name
https://github.com/apache/iceberg-python/blob/583a7e97db28afa4259cdf504611845222338893/pyiceberg/catalog/__init__.py#L954
I wonder if we just missed using this in some situations.
@djouallah how is your project setup currently?
In case it's relevant, there is an issue re: adopting a pluggable LocationProvider interface akin to what exists in iceberg java sdk. If it does turn out that we're currently handling the location inconsistently in pyiceberg, this could be a good opportunity to begin the process of putting this behind a single interface (pluggability and alternate implementations could come later).
@kevinjqliu https://github.com/djouallah/Fabric_Notebooks_Demo/blob/main/iceberg/Iceberg_azure.ipynb
if i remove the location, I get an error
it seems to be a bug with the sql catalog, as it works fine with Polaris, fwiw, this how I am using the sqlcatalog
catalog = SqlCatalog(
"default",
**{
"uri" : userdata.get("postgresql_db"),
"adlfs.account-name" : userdata.get("account_name") ,
"adlfs.account-key" : userdata.get ("AZURE_STORAGE_ACCOUNT_KEY"),
"adlfs.tenant-id" : userdata.get("azure_storage_tenant_id"),
"py-io-impl" : "pyiceberg.io.fsspec.FsspecFileIO",
"legacy-current-snapshot-id": True
},
)
@djouallah I don't think it is obvious, but you can add the location to the catalog itself:
catalog = SqlCatalog(
"default",
**{
"uri" : userdata.get("postgresql_db"),
"adlfs.account-name" : userdata.get("account_name") ,
"adlfs.account-key" : userdata.get ("AZURE_STORAGE_ACCOUNT_KEY"),
"adlfs.tenant-id" : userdata.get("azure_storage_tenant_id"),
"py-io-impl" : "pyiceberg.io.fsspec.FsspecFileIO",
"legacy-current-snapshot-id": True,
"warehouse": "/tmp/some_location/" # Result: /tmp/some_location/schema.db/table/
},
)
Feature Request / Improvement
currently by default, pyiceberg write all table inside the same path, which is arguable a very bad idea as later, there is no way to provide a fine level security, the current solution today is to explicitly write the path when creating a new table, which I don't think it is a great idea, as the user (IMHO) should not deal with storage path.
it would be nice if pyiceberg has an option to automatically generate a path based on a table name, if I write schema.table, the path will be base_location/schema/table.
to be clear, I am not asking to change the iceberg catalog spec, but a new friendly functionality when using pyiceberg.