apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.69k stars 422 forks source link

[Format] Specify VARIABLE_SIZE_LIST Logical type #437

Open rok opened 1 week ago

rok commented 1 week ago

Arrow recently introduced FixedShapeTensor and VariableShapeTensor canonical extension types that use FixedSizeList and StructArray(List, FixedSizeList) as storage respectfully. These are targeted at machine learning and scientific applications that deal with large datasets and would benefit from using Parquet as on disk storage.

If Arrow's List was stored as BYTE_ARRAY we would likely see reduced overhead due to reading and writing definition and repetition levels. See discussion here. It would therefore be beneficial to introduce a VARIABLE_SIZE_LIST logical type to Parquet.