geoarrow / geoarrow-rs

GeoArrow in Rust, Python, and JavaScript (WebAssembly) with vectorized geometry operations
http://geoarrow.org/geoarrow-rs/
Apache License 2.0
259 stars 17 forks source link

Remove `i64` offset support #802

Closed kylebarron closed 1 month ago

kylebarron commented 1 month ago

Arrow list arrays have a values array and an offsets array, that says how many of each values are in each row of the list. The Arrow spec allows these offsets to be either i32 or i64. We originally wanted to support both because polars only supports i64 offsets.

But there is an incredible amount of complexity currently in supporting geometry arrays with i64 offsets. There's an O: OffsetSizeTrait everywhere. The Geometry scalar is currently parametrized by O: OffsetSizeTrait, making it hard to use generically. And every place where we downcast arrays has roughly 2x the variants because we have Large* data types everywhere.

The largest value an i32 array can hold is 2^31, or 2,147,483,648. So in order to overflow i32 offsets, a values array would have to hold 2^31 + 1 elements. But for 2D coordinates, each element of values would have two f64s, or 16 bytes. So that means you'd need to have (2^31 + 1) * 16 bytes, or 32 gigabytes of coordinate data.

Polars has shown no interest in supporting Arrow extension types, and so I think it's time to greatly simplify the codebase. Some future integration with polars would have a bit of overhead (unless they add a column type that can be a generic Arrow type) to cast the i64 offsets to i32, but at this point it's worth the development simplicity here. And DataFusion is showing sincere interest in supporting user-defined types, so I think that's our path forward.

Ref https://github.com/geoarrow/geoarrow-rs/issues/801