dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.69k stars 180 forks source link

write iceberg tables on filesystem destination #1996

Open rudolfix opened 3 weeks ago

rudolfix commented 3 weeks ago

Background We aim to support backend and server-less write support for iceberg tables. We'd like to do that in similar way we do it to delta-tables: make table_format iceberg to be recognized by the filesystem destination. From the user PoV this means:

We want to use pyiceberg. This limits the write disposition to append and replace (until upsert is implemented). We also wont' support vacuum, compact or z-order ops on the tables.

Tasks

    • [ ] we maintain a "technical" catalog: sqllite file per table. those files we store together with the data
    • [ ] to write a table we lock the sqllite file with TransactionalFile, pull it locally, use with pyiceberg and then write it back.
    • [ ] use pyiceberg to append, replace tables, create partitions, do schema evolution etc.
    • [ ] support all buckets via fsspec
    • [ ] like for delta, expose pyiceberg for a given table. read only (catalog without lock) and r/w with lock on catalog (maybe via context manager). this will allow people ie. to delete or rebuild partitions on a table.
    • [ ] support filesystem sql_client to create views on ICEBERG via duckdb
jorritsandbrink commented 2 weeks ago

@rudolfix

  1. perhaps we can use an in-memory SQLite database instead of persisting the file to disk
    • if I understand correctly, at its core the catalog is only mapping table name to table metadata (which lives on the filesystem)—we can populate the in-memory SQLite database with this mapping based on dlt metadata
  2. perhaps Iceberg's optimistic concurrency makes locking unnecessary