apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.69k stars 422 forks source link

PARQUET-2474: Add FIXED_SIZE_LIST logical type #241

Open rok opened 1 month ago

rok commented 1 month ago

As proposed in https://github.com/apache/arrow/issues/34510 and on ML, PARQUET-2474.

Arrow recently introduced FixedShapeTensor and VariableShapeTensor canonical extension types that use FixedSizeList and StructArray(List, FixedSizeList) as storage respectfully. These are targeted at machine learning and scientific applications that deal with large datasets and would benefit from using Parquet as on disk storage.

However currently FixedSizeList is stored as List in Parquet which adds significant conversion overhead when reading and writing as discussed here. It would therefore be beneficial to introduce a FIXED_SIZE_LIST logical type to Parquet.

rok commented 1 month ago

cc @wgtmac @tustvold @alippai @mapleFU @AlenkaF

tustvold commented 1 month ago

One thing to perhaps give thought to is how this might represent nested lists, say you wanted to encode a m by n matrix, would you just encode this as a m * n list or do we want to support this as a first-class concept?

I had perhaps been anticipating that fixed size list would be a variant of "REPEATED" as opposed to a physical type, that is just able to avoid incrementing the max_def_level and max_rep_level. This would make it significantly more flexible I think, although I concede it will make it harder to implement.

wgtmac commented 1 month ago

cc @JFinis

rok commented 4 weeks ago

Apologies for taking a while to reply.

I've split this into two cases: FixedSizeListType (length is constant) and VariableSizeListType (length differs per row) for the sake of discussion. I would move VariableSizeListType into a separate PR if we even decide it is needed next to ListType.

One thing to perhaps give thought to is how this might represent nested lists, say you wanted to encode a m by n matrix, would you just encode this as a m * n list or do we want to support this as a first-class concept?

We could start with a more general multidimensional array definition and have list be a 1 dimensional array. Additional metadata required would not be that bad. I'm just a bit scared of validation and striding logic bleeding into parquet implementations. Do we have any other inputs / opinions?

I had perhaps been anticipating that fixed size list would be a variant of "REPEATED" as opposed to a physical type, that is just able to avoid incrementing the max_def_level and max_rep_level. This would make it significantly more flexible I think, although I concede it will make it harder to implement.

That's interesting. What would you expect performance wise with this approach?

rok commented 1 week ago

Thanks for the review @etseidl ! I've updated this with your suggestions.

alippai commented 1 week ago

@ritchie46 would this be useful for your new polars Array type?