Arrow list arrays have a values array and an offsets array, that says how many of each values are in each row of the list. The Arrow spec allows these offsets to be either i32 or i64. We originally wanted to support both because polars only supports i64 offsets.
But there is an incredible amount of complexity currently in supporting geometry arrays with i64 offsets. There's an O: OffsetSizeTraiteverywhere. The Geometry scalar is currently parametrized by O: OffsetSizeTrait, making it hard to use generically. And every place where we downcast arrays has roughly 2x the variants because we have Large* data types everywhere.
The largest value an i32 array can hold is 2^31, or 2,147,483,648. So in order to overflow i32 offsets, a values array would have to hold 2^31 + 1 elements. But for 2D coordinates, each element of values would have two f64s, or 16 bytes. So that means you'd need to have (2^31 + 1) * 16 bytes, or 32 gigabytes of coordinate data.
Polars has shown no interest in supporting Arrow extension types, and so I think it's time to greatly simplify the codebase. Some future integration with polars would have a bit of overhead (unless they add a column type that can be a generic Arrow type) to cast the i64 offsets to i32, but at this point it's worth the development simplicity here. And DataFusion is showing sincere interest in supporting user-defined types, so I think that's our path forward.
Arrow list arrays have a values array and an offsets array, that says how many of each values are in each row of the list. The Arrow spec allows these offsets to be either i32 or i64. We originally wanted to support both because polars only supports i64 offsets.
But there is an incredible amount of complexity currently in supporting geometry arrays with i64 offsets. There's an
O: OffsetSizeTrait
everywhere. TheGeometry
scalar is currently parametrized byO: OffsetSizeTrait
, making it hard to use generically. And every place where we downcast arrays has roughly 2x the variants because we haveLarge*
data types everywhere.The largest value an
i32
array can hold is2^31
, or2,147,483,648
. So in order to overflowi32
offsets, avalues
array would have to hold2^31 + 1
elements. But for 2D coordinates, each element ofvalues
would have two f64s, or 16 bytes. So that means you'd need to have(2^31 + 1) * 16
bytes, or 32 gigabytes of coordinate data.Polars has shown no interest in supporting Arrow extension types, and so I think it's time to greatly simplify the codebase. Some future integration with polars would have a bit of overhead (unless they add a column type that can be a generic Arrow type) to cast the i64 offsets to i32, but at this point it's worth the development simplicity here. And DataFusion is showing sincere interest in supporting user-defined types, so I think that's our path forward.
Ref https://github.com/geoarrow/geoarrow-rs/issues/801