apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
590 stars 218 forks source link

Is it possible to create a table without instantiating an Iceberg catalog? #1535

Closed aaron-siegel closed 3 weeks ago

aaron-siegel commented 3 weeks ago

Question

I see that it's possible to load an existing table without instantiating an Iceberg catalog, via StaticTable.from_metadata().

Is there any way to create a table without a catalog? NoopCatalog (as expected) throws an exception on any mutable operation.

Thanks, Aaron

corleyma commented 3 weeks ago

What you're referring to sounds more like a file-based catalog (called the hadoop catalog in e.g. Spark/Iceberg java sdk). See this issue, which was migrated from https://github.com/apache/iceberg/issues/3220 that has more of the actual discussion, for prior consideration of the topic.

tl;dr implementing a file-based catalog in pyiceberg was rejected at the time both because complexity in implementation and risks of folks attempting to use it in production against object stores without atomic move operations.

aaron-siegel commented 3 weeks ago

I'm referring more to using no catalog at all, rather than a file-based catalog; something akin to the InMemoryCatalog provided by Java iceberg that we can throw away once the table and metadata are created. The idea would be to treat the output as a one-off snapshot and load it later with StaticTable.

I'm getting the impression that this may not be an intended or supported pattern in pyiceberg, but still curious if there's a way. Thanks for your quick response!

Fokko commented 3 weeks ago

@aaron-siegel The SqlCatalog with sqlite should do the trick then, you can find examples here: https://py.iceberg.apache.org/#connecting-to-a-catalog

aaron-siegel commented 3 weeks ago

@Fokko Thanks! How would we specify configuration to pyiceberg? The docs state that pyiceberg expects to find a .yaml file in $HOME; is there anyway to override this with dynamically specified configuration for an ephemeral catalog?

Fokko commented 3 weeks ago

The YAML is mostly to avoid leaking secrets/credentials into the Python code. The example is unauthenticated, so you can directly pass as the properties which is where the dict is expanded into arguments:

catalog = SqlCatalog(
    "default",
    **{
        "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
        "warehouse": f"file://{warehouse_path}",
    },
)
aaron-siegel commented 3 weeks ago

Yes this worked for me, thank you!

Aaron

Fokko commented 3 weeks ago

@aaron-siegel Any time, let us know if we can improve the documentation. I'll close this issue for now, thanks for asking.

aaron-siegel commented 3 weeks ago

@Fokko Re improving the documentation, yes - I now see that it was staring me in the face in "Getting Started", but I was looking through the "API" section, which doesn't mention this pattern and seems to imply that a configuration file is mandatory. ("This information must be placed inside a file called .pyiceberg.yaml" - but if I understand things right, it doesn't have to be!)

So it might be worth mentioning somewhere in the API section that there are lighter-weight ways to do both configuration and table creation.

Thanks for all your help!