[C++] Avoid producing run-end encoded arrays with runs that have a length longer than INT_MAX

felipecrv commented 1 year ago

Hypothesis

Limiting the length of runs (not the length of logical arrays) is going to prevent multi-language integration pains at a very low storage/memory cost — compresssing ~2B elements into a single run already yields a great compression factor. If we need to produce longer runs we can simply append multiple < INT_MAX-sized runs.

The spec on array lengths

https://arrow.apache.org/docs/dev/format/Columnar.html#array-lengths

Array lengths are represented in the Arrow metadata as a 64-bit signed integer. An implementation of Arrow is considered valid even if it only supports lengths up to the maximum 32-bit signed integer, though. If using Arrow in a multi-language environment, we recommend limiting lengths to 2 31 - 1 elements or less. Larger data sets can be represented using multiple array chunks.

The solution proposed by the spec for languages that don't support 64-bit integers is to use multiple array chunks. Chunking the physical arrays of a run-end encoded logical array is much easier when we don't have to split in the middle of a run. So limiting the runs at INT_MAX means we only have to worry about the regular split of logical length.

Component(s)

C++

westonpace commented 1 year ago

+1. Should we update columnar.rst with a similar note?

felipecrv commented 1 year ago

@westonpace sure, updating the reference docs is part of of fixing this issue if people participating in the discussion are in favor of this. :wink:

apache / arrow