support pyarrow recordbatch as a valid data source for writing Iceberg table

apache / iceberg-python

Apache PyIceberg

https://py.iceberg.apache.org/

Apache License 2.0

388 stars 144 forks source link

support pyarrow recordbatch as a valid data source for writing Iceberg table #1004

Open djouallah opened 1 month ago

djouallah commented 1 month ago

Feature Request / Improvement

currently using this in a an environment with limited RAM

  df=final.arrow()
  catalog , table_location = connect_catalog(storage)
  catalog.create_table_if_not_exists(db+".scada",schema=df.schema,location= table_location+f'/{db}/scada' )
  catalog.load_table(db+".scada").append(df)

I get sometimes out of memory because arrow table needs to be fully materialized in memory, I can generate recordbatch from my source system which will use less memory,

Fokko commented 1 month ago

@djouallah Thanks for raising this. To clarify, does the final.arrow() cause an OOM, or the .append operation?

djouallah commented 1 month ago

Append operation cause OOM