datafusion-contrib / datafusion-orc

Implementation of Apache ORC file format use Apache Arrow in-memory format
Apache License 2.0
28 stars 8 forks source link

Support decoding Union column with up to including 256 variants #77

Open Jefffrey opened 3 months ago

Jefffrey commented 3 months ago

According to https://orc.apache.org/specification/ORCv1/

Currently ORC union types are limited to 256 variants, which matches the Hive type model.

However in Arrow, UnionArrays are limited to 127 variants: https://arrow.apache.org/docs/format/Columnar.html#union-layout

A union with more than 127 possible types can be modeled as a union of unions.

To support this, would need to do as above and decode into union of union

See initial Union support here: https://github.com/datafusion-contrib/datafusion-orc/commit/ee69b91cb2ce4c18ee5148bd541658d362fc577f