apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
429 stars 156 forks source link

PyIceberg Production Use case survey #1202

Open kevinjqliu opened 4 weeks ago

kevinjqliu commented 4 weeks ago

Feature Request / Improvement

As part of the journey toward version 1.0, we want to capture how this library is used in "production" environments.

Would love to hear from current users (and potential users) on different use cases. This will better inform the future roadmap.

Please include use cases in this issue, or if necessary I can start a Google Survey.

mariotaddeucci commented 1 week ago

Hey, actually I'm using in production for small datasets in combination with duckdb specially to avoid small files with webscrapping.

For ingestion, reading many raw files (json, csv, and parquet), all off then with a key using ulid (sortable id is necessary) in combination with overwrite specifying this key as overwrite filter. Duckdb generates a record_batach_reader, which allows to generate the table and schema without load all in memory, after creating the table is necessary to converte into a arrow table to write the final iceberg table.

Because of the sortable id, it's possible to use the the filter predicate overwriting the data between upper and lower bound the data set to be ingested.

The table maintenance still using spark for expiring snapshot.

To avoid small files, after certain period using the duckdb native iceberg read, I reload the entire dataset and overwrite it fully (a workaround for rewrite files procedure)

I would love to expand it for more scenarios but some features are necessary like

These pipelines are leaving from spark server and running on isolated containers.

andreapiso commented 1 day ago

Using pyiceberg alongside Trino. Our ETL is in Trino, pyiceberg Is great for assets where we are doing things like grabbing data from APIs. Instead of storing files and crawling them with something like glue into iceberg tables, we can directly write that data into iceberg so that our Trino pipelines can process it directly, super convenient!