Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
1.79k stars 108 forks source link

bulk read from Iceberg catalog #2231

Closed djouallah closed 3 weeks ago

djouallah commented 3 weeks ago

currently using this for three tables, but probably it should be a better way when we have like 100 of tables

import daft
scada = daft.read_iceberg(catalog.load_table('aemo.scada'))
calendar = daft.read_iceberg(catalog.load_table('aemo.calendar'))
duid = daft.read_iceberg(catalog.load_table('aemo.duid'))

something like

for x in catalog.list_tables("db"):
   daft.read_iceberg(catalog.load_table(x)).to_df(x)
jaychia commented 3 weeks ago

Hi @djouallah! Are you encountering any issues when reading 100 of these tables?

Note that even calling catalog.load_table might be slow if you're running this on a ton of tables, but that's not something Daft can fix because it's just how Iceberg/PyIceberg works unfortunately 😕 There is definitely going to be some fixed overhead of reading and parsing each table's metadata.

djouallah commented 3 weeks ago

sorry, what I meant, I don't want to write 100 LOC in order to read 100 tables, my ask is about a simpler API, not about performance

jaychia commented 3 weeks ago

Ah got it :)

I think if you're reading N number of tables, you should just store it in a dictionary then. Something like this:

dataframes = {
    table_name: daft.read_iceberg(catalog.load_table(table_name)) for table_name in catalog.list_tables("db")
}

Note that these calls to catalog.load_table and daft.read_iceberg do take some time to run! If you don't need all 100 tables it would be a good idea not to call them on all the tables 😛