apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.69k stars 422 forks source link

PARQUET-2414: Extend BYTE_STREAM_SPLIT to support INT32, INT64 and FIXED_LEN_BYTE_ARRAY data #229

Closed pitrou closed 3 months ago

etseidl commented 4 months ago

+1 I think this is great. Are PoCs needed for this? I'm interested in seeing how well this works as a DELTA_BINARY_PACKED replacement for my data.

pitrou commented 4 months ago

@etseidl I've written the implementation for Parquet C++ here: https://github.com/apache/arrow/pull/40094

I was planning to implement it for Parquet Java, but you may want to do it as well.

etseidl commented 4 months ago

I was planning to implement it for Parquet Java, but you may want to do it as well.

Sounds good. I'll put it in my queue. I'll check out your arrow implementation to see if there are any pitfalls to avoid. Thanks!

GregoryKimball commented 3 months ago

Thank you @pitrou for investigating this! Extending BYTE_STREAM_SPLIT to more data types will give us great new options in RAPIDS.