geoarrow / geoarrow

Specification for storing geospatial data in Apache Arrow
https://geoarrow.org
BSD 3-Clause "New" or "Revised" License
420 stars 22 forks source link

Necessity of two coordinate layout options? #72

Open b4l opened 4 hours ago

b4l commented 4 hours ago

The format specification states:

Compared to the Struct representation of a coordinate array, this representation[Interleaved] may provide better performance for some operations and/or provide better compatability with the memory layout of existing libraries.

This sounds not so convincing as to why there should be two options.

Are there any concrete resources like evaluations and benchmarks favoring one or the other?

paleolimbot commented 4 hours ago

You can see https://github.com/geoarrow/geoarrow/pull/26 for the backstory here!

In general with this specification, there is some tension between being very specific (such that libraries implementing compute don't have to handle things like multiple coordinate representations), and very broad (so that producers can "slap" an extension type on their existing memory without performing a copy). This is also why we also have serialized representations (e.g., WKT and WKB) here, so that non-spatial components like database drivers can mark their output without implementing or depending on a parser.

The separated implementation is used by GeoParquet (where we get usable column statistics from keeping xs and ys separated), and DuckDB spatial (for its "native" types). It is also how many points are stored in tables (e.g., with an x column and a y column, which can be zero-copy converted to a "struct"). Structs are better supported than fixed-sized lists as well (and the fixed-size list doesn't exist in some places like DuckDB).

The interleaved coordinate representation is in theory the standard for most spatial libraries at the moment, although there aren't currently any examples where we are achieving zero-copy conversion that I am aware of (we might be getting better performance copying from WKB since those coordinates are interleaved, but I don't think any of us have checked).

If you can't tell, I'm a fan of the struct representation and wish we'd only gone with that one (but @kylebarron I'm pretty sure doesn't agree with me!)

kylebarron commented 4 hours ago

but @kylebarron I'm pretty sure doesn't agree with me!

I'd generally prefer having only one coordinate representation. But it's hard because some pieces of the ecosystem might only support one or the other.

e.g. today deck.gl only supports the interleaved layout (they'd be open to implementing support for separated coordinates, but the implementation would probably take a while since it would need some GPU-level code changes). So even if GeoArrow only supported separated coordinates, I'd still need to support interleaved coordinates in places like lonboard (at least temporarily).

b4l commented 3 hours ago

Yeah, I am a fan too! From the quick read, SoA has been proven superior in almost all cases; nonetheless, the interleaved layout made it into the spec. As pointed out by @jorisvandenbossche, one can easily convert to the other so I don't see why to support both. If people prefer an interleaved layout for convenience or whatever, they are free to do so; however, it should not transcend the interface boundary, in my opinion. While zero-copy is fancy, it's rather limited to certain parts of the whole data lifecycle in which a layout conversion is probably neglectable given an adequate transformation approach.