apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
472 stars 174 forks source link

when using SQL catalog, table location is not optional #1254

Closed djouallah closed 2 weeks ago

djouallah commented 3 weeks ago

Feature Request / Improvement

currently by default, pyiceberg write all table inside the same path, which is arguable a very bad idea as later, there is no way to provide a fine level security, the current solution today is to explicitly write the path when creating a new table, which I don't think it is a great idea, as the user (IMHO) should not deal with storage path.

it would be nice if pyiceberg has an option to automatically generate a path based on a table name, if I write schema.table, the path will be base_location/schema/table.

to be clear, I am not asking to change the iceberg catalog spec, but a new friendly functionality when using pyiceberg.

kevinjqliu commented 3 weeks ago

+1 for sensible defaults.

in some scenarios, the warehouse location is set to base_warehouse_location/database_name/table_name https://github.com/apache/iceberg-python/blob/583a7e97db28afa4259cdf504611845222338893/pyiceberg/catalog/__init__.py#L954

I wonder if we just missed using this in some situations.

@djouallah how is your project setup currently?

corleyma commented 3 weeks ago

In case it's relevant, there is an issue re: adopting a pluggable LocationProvider interface akin to what exists in iceberg java sdk. If it does turn out that we're currently handling the location inconsistently in pyiceberg, this could be a good opportunity to begin the process of putting this behind a single interface (pluggability and alternate implementations could come later).

djouallah commented 3 weeks ago

@kevinjqliu https://github.com/djouallah/Fabric_Notebooks_Demo/blob/main/iceberg/Iceberg_azure.ipynb

if i remove the location, I get an error

djouallah commented 3 weeks ago

it seems to be a bug with the sql catalog, as it works fine with Polaris, fwiw, this how I am using the sqlcatalog

catalog = SqlCatalog(
      "default",
      **{
          "uri"                : userdata.get("postgresql_db"),
          "adlfs.account-name" : userdata.get("account_name") ,
          "adlfs.account-key"  : userdata.get ("AZURE_STORAGE_ACCOUNT_KEY"),
          "adlfs.tenant-id"    : userdata.get("azure_storage_tenant_id"),
          "py-io-impl"         : "pyiceberg.io.fsspec.FsspecFileIO",
          "legacy-current-snapshot-id": True
      },
                        )
Fokko commented 3 weeks ago

@djouallah I don't think it is obvious, but you can add the location to the catalog itself:

catalog = SqlCatalog(
      "default",
      **{
          "uri"                : userdata.get("postgresql_db"),
          "adlfs.account-name" : userdata.get("account_name") ,
          "adlfs.account-key"  : userdata.get ("AZURE_STORAGE_ACCOUNT_KEY"),
          "adlfs.tenant-id"    : userdata.get("azure_storage_tenant_id"),
          "py-io-impl"         : "pyiceberg.io.fsspec.FsspecFileIO",
          "legacy-current-snapshot-id": True,

          "warehouse": "/tmp/some_location/"  # Result: /tmp/some_location/schema.db/table/
      },
)