deephaven / deephaven-core

Deephaven Community Core
Other
252 stars 80 forks source link

Add support to write deephaven tables to iceberg #6125

Open malhotrashivam opened 6 days ago

malhotrashivam commented 6 days ago

The spec is being developed along with the development work but at a higher level, the decided APIs look like:

  1. void append(Tables...) or appendDataFiles/appendTables: writes tables to data files 1:1, does a transaction to add new data files
  2. void overwrite(Tables...) : writes tables to data files 1:1, does a transaction to remove all data files and add new ones
  3. List<URI> write(Tables…) : writes tables to data files 1:1, does not put anything in transaction

An important requirement is that we need to persist the Iceberg schema element field-ids into the parquet schema Type field_id field, to map iceberg columns to parquet columns.

devinrsmith commented 6 days ago

We should also see if there is any specific guidance on metadata we should be writing down; in the case of writing a pyarrow table using pyiceberg, we've noticed that the metadata key ARROW:schema contains the arrow schema; in the case of pyspark, it wrote a metadata key iceberg.schema that contains the iceberg schema.