More Control over Parquet Writing

deephaven / deephaven-core

Deephaven Community Core

Other

256 stars 80 forks source link

More Control over Parquet Writing #6123

Open cpwright opened 1 month ago

cpwright commented 1 month ago

As a systems integrator, I want to be able to have increased control over writing parquet files so that I can implement a process for transforming data overnight.

This ticket needs more definition before we work on it, but I would like to be able to either pass a row-group of data at once to the write function; or alternatively pass one column of a row-group at one time so that I can ensure read-locality for my input data.

rcaudy commented 1 month ago

As noted by @cpwright , we're still defining this ticket and its priority.

One detail we'll need to be sure to handle is data indexes when there are multiple row groups. One approach might be to mirror the row group structure of the "main" file in each index file, as a hint that we potentially need to shift the row sets persisted to the index table in order to compensate for row group shifts in the main table.

devinrsmith commented 1 month ago

I could also see #6125 as imposing some writing requirements; potentially the need to tack on field_ids, or add KV metadata, amongst other things (I don't know what support we may or may not already have for those types of reqs).