Limiting the length of runs (not the length of logical arrays) is going to prevent multi-language integration pains at a very low storage/memory cost — compresssing ~2B elements into a single run already yields a great compression factor. If we need to produce longer runs we can simply append multiple < INT_MAX-sized runs.
Array lengths are represented in the Arrow metadata as a 64-bit signed integer. An implementation of Arrow is considered valid even if it only supports lengths up to the maximum 32-bit signed integer, though. If using Arrow in a multi-language environment, we recommend limiting lengths to 2 31 - 1 elements or less. Larger data sets can be represented using multiple array chunks.
The solution proposed by the spec for languages that don't support 64-bit integers is to use multiple array chunks. Chunking the physical arrays of a run-end encoded logical array is much easier when we don't have to split in the middle of a run. So limiting the runs at INT_MAX means we only have to worry about the regular split of logical length.
Hypothesis
Limiting the length of runs (not the length of logical arrays) is going to prevent multi-language integration pains at a very low storage/memory cost — compresssing ~2B elements into a single run already yields a great compression factor. If we need to produce longer runs we can simply append multiple
< INT_MAX
-sized runs.The spec on array lengths
https://arrow.apache.org/docs/dev/format/Columnar.html#array-lengths
The solution proposed by the spec for languages that don't support 64-bit integers is to use multiple array chunks. Chunking the physical arrays of a run-end encoded logical array is much easier when we don't have to split in the middle of a run. So limiting the runs at
INT_MAX
means we only have to worry about the regular split of logical length.Component(s)
C++