apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
823 stars 163 forks source link

Use parquet crate for decoding Parquet data into Arrow arrays #1040

Open andygrove opened 3 weeks ago

andygrove commented 3 weeks ago

What is the problem the feature request solves?

Comet has native code for decoding Parquet structures into Arrow arrays. This issue is for discussing delegating to the parquet crate instead for these operations.

The benefits of this approach include:

Possible downsides of this approach:

[1] https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-1/ [2] https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-2/

Describe the potential solution

No response

Additional context

No response

parthchandra commented 3 weeks ago
  • Support for complex types. The parquet crate already supports reading maps and structs.

@andygrove The doc here https://crates.io/crates/parquet says the reader supports Primitive column value readers. Doesn't say whether complex types are supported. Also, something to keep in mind, Comet has Spark specific decoding (for instance Int96 timestamps https://github.com/apache/datafusion-comet/blob/00cf79b253d9fa31dcc0506facbaf7b2dac25cdd/native/core/src/parquet/read/values.rs#L841). Some post processing may need to be done unless we can hook in our own decoder for special cases like this.

parthchandra commented 3 weeks ago

As a first step, we can certainly switch to the lower level file APIs to replace the Comet FileReader. https://docs.rs/parquet/latest/parquet/file/serialized_reader/struct.SerializedPageReader.html would do very nicely.

viirya commented 3 weeks ago

Hmm, I guess that doc is either out-of-dated or inaccurate. The Parquet supports complex types.

andygrove commented 2 weeks ago

Hmm, I guess that doc is either out-of-dated or inaccurate. The Parquet supports complex types.

Yes, there is StructArrayReader, for example:

pub struct StructArrayReader {
    children: Vec<Box<dyn ArrayReader>>,
    data_type: ArrowType,
    struct_def_level: i16,
    struct_rep_level: i16,
    nullable: bool,
}
andygrove commented 2 weeks ago

I created a Google doc where we can collaborate on design:

https://docs.google.com/document/d/1eiDFEScPjxBMahJW6lmBI8JjVlI6CwhiJgkTSsTvPVY/edit?usp=sharing