Closed aaron-siegel closed 3 weeks ago
What you're referring to sounds more like a file-based catalog (called the hadoop catalog in e.g. Spark/Iceberg java sdk). See this issue, which was migrated from https://github.com/apache/iceberg/issues/3220 that has more of the actual discussion, for prior consideration of the topic.
tl;dr implementing a file-based catalog in pyiceberg was rejected at the time both because complexity in implementation and risks of folks attempting to use it in production against object stores without atomic move operations.
I'm referring more to using no catalog at all, rather than a file-based catalog; something akin to the InMemoryCatalog provided by Java iceberg that we can throw away once the table and metadata are created. The idea would be to treat the output as a one-off snapshot and load it later with StaticTable
.
I'm getting the impression that this may not be an intended or supported pattern in pyiceberg, but still curious if there's a way. Thanks for your quick response!
@aaron-siegel The SqlCatalog
with sqlite
should do the trick then, you can find examples here: https://py.iceberg.apache.org/#connecting-to-a-catalog
@Fokko Thanks! How would we specify configuration to pyiceberg? The docs state that pyiceberg expects to find a .yaml
file in $HOME
; is there anyway to override this with dynamically specified configuration for an ephemeral catalog?
The YAML is mostly to avoid leaking secrets/credentials into the Python code. The example is unauthenticated, so you can directly pass as the properties which is where the dict
is expanded into arguments:
catalog = SqlCatalog(
"default",
**{
"uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
"warehouse": f"file://{warehouse_path}",
},
)
Yes this worked for me, thank you!
Aaron
@aaron-siegel Any time, let us know if we can improve the documentation. I'll close this issue for now, thanks for asking.
@Fokko Re improving the documentation, yes - I now see that it was staring me in the face in "Getting Started", but I was looking through the "API" section, which doesn't mention this pattern and seems to imply that a configuration file is mandatory. ("This information must be placed inside a file called .pyiceberg.yaml
" - but if I understand things right, it doesn't have to be!)
So it might be worth mentioning somewhere in the API section that there are lighter-weight ways to do both configuration and table creation.
Thanks for all your help!
Question
I see that it's possible to load an existing table without instantiating an Iceberg catalog, via
StaticTable.from_metadata()
.Is there any way to create a table without a catalog?
NoopCatalog
(as expected) throws an exception on any mutable operation.Thanks, Aaron