Open andygrove opened 3 weeks ago
- Support for complex types. The parquet crate already supports reading maps and structs.
@andygrove The doc here https://crates.io/crates/parquet says the reader supports Primitive column value readers
. Doesn't say whether complex types are supported.
Also, something to keep in mind, Comet has Spark specific decoding (for instance Int96 timestamps https://github.com/apache/datafusion-comet/blob/00cf79b253d9fa31dcc0506facbaf7b2dac25cdd/native/core/src/parquet/read/values.rs#L841). Some post processing may need to be done unless we can hook in our own decoder for special cases like this.
As a first step, we can certainly switch to the lower level file
APIs to replace the Comet FileReader. https://docs.rs/parquet/latest/parquet/file/serialized_reader/struct.SerializedPageReader.html
would do very nicely.
Hmm, I guess that doc is either out-of-dated or inaccurate. The Parquet supports complex types.
Hmm, I guess that doc is either out-of-dated or inaccurate. The Parquet supports complex types.
Yes, there is StructArrayReader
, for example:
pub struct StructArrayReader {
children: Vec<Box<dyn ArrayReader>>,
data_type: ArrowType,
struct_def_level: i16,
struct_rep_level: i16,
nullable: bool,
}
I created a Google doc where we can collaborate on design:
https://docs.google.com/document/d/1eiDFEScPjxBMahJW6lmBI8JjVlI6CwhiJgkTSsTvPVY/edit?usp=sharing
What is the problem the feature request solves?
Comet has native code for decoding Parquet structures into Arrow arrays. This issue is for discussing delegating to the parquet crate instead for these operations.
The benefits of this approach include:
Possible downsides of this approach:
[1] https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-1/ [2] https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-2/
Describe the potential solution
No response
Additional context
No response