deephaven / deephaven-core

Deephaven Community Core
Other
238 stars 79 forks source link

Parquet: Support repetition level >1 and multi-column fields #871

Open rcaudy opened 3 years ago

rcaudy commented 3 years ago

Currently, we regard nested repetition and multi-column fields as uncommon and hard to map into a columnar data table like Deephaven's. This feature request is intended to capture views to the contrary.

Linked to #294 , although intended for a later effort.

rcaudy commented 2 years ago

Likely the first step is some kind of "flattening", but this is contrary to the intent of the Dremel design, so maybe we can think of a better solution.

rcaudy commented 2 years ago

I'll be improving our error messages with a PR shortly. New messages: For:

t = io.deephaven.db.tables.utils.ParquetTools.readTable("/data/parquetFiles/nonnullable_nested_v1_IMPALA_NULLS_NONE.parquet")

We'll see:

java.lang.UnsupportedOperationException: Unsupported maximum repetition level 2 in column int_array_array/list/element/list/element

For:

t = io.deephaven.db.tables.utils.ParquetTools.readTable("/data/parquetFiles/repeated_nested_RUST_NONE.parquet")

We'll see:

java.lang.UnsupportedOperationException: Encountered unsupported multi-column field phoneNumbers: found columns phoneNumbers/phone/number and phoneNumbers/phone/kind
devinrsmith commented 1 year ago

It might be nice to be able to specify which columns you care about for your Table - in which case, the user can choose to not include the nested columns.

There's a mechanism right now to provide column instructions:

from deephaven.parquet import read, ColumnInstruction

t = read(
    path="/snappy.parquet",
    col_instructions=[
        ColumnInstruction(column_name="date", parquet_column_name="date")
    ],
)

but this currently throws the error:

java.lang.UnsupportedOperationException: Encountered unsupported multi-column field outputs: found columns outputs/list/element/address and outputs/list/element/index
    at io.deephaven.parquet.table.ParquetSchemaReader.lambda$readParquetSchema$1(ParquetSchemaReader.java:174)
    at java.base/java.util.HashMap.compute(HashMap.java:1316)
    at io.deephaven.parquet.table.ParquetSchemaReader.readParquetSchema(ParquetSchemaReader.java:169)
    at io.deephaven.parquet.table.ParquetTools.convertSchema(ParquetTools.java:647)
    at io.deephaven.parquet.table.ParquetTools.readTableInternal(ParquetTools.java:384)
    at io.deephaven.parquet.table.ParquetTools.readTable(ParquetTools.java:94)
devinrsmith commented 8 months ago

A user has hit this w/ the parquet viewer, see https://github.com/devinrsmith/deephaven-parquet-viewer/issues/9