Open H-Plus-Time opened 4 months ago
Thanks for the issue! I definitely welcome some thoughts on the API design.
I may have played around with a wasm-bindgen'd equivalent of the python GeoTable (notes on that in a bit) - that's one of the goals of the JS target, yeah?
I'm not sure if you've noticed, but I've renamed the underlying Rust struct to Table
and was planning to also rename the Python class to Table
as well. This enables a few things:
A table works just the same with spatial and non-spatial data, and with multiple geometry columns. Say you're reading a non-standard Parquet file where geometry is encoded as WKT. That would fail with the current Table abstraction because it requires a known geometry column in GeoArrow representation. With a fully general Table representation, you could do
table = read_parquet('path/to/file.parquet')
new_table = table.append_column(parse_wkt(table.column('wkt_geometry'), 'geometry')
new_table
would then be inferred as having a geometry column because of the geoarrow.*
tag on the new column.
Actually the API here would be a bit more annoying because there's no way for a column (a Vec<Arc<dyn Array>>
) to maintain its own field metadata. So more likely we'd need parse_wkt
to return a tuple of ChunkedArray
, Field
.
Actually thinking about this a bit more, this wouldn't be necessary (in Python) as long as append_column
supports the PyCapsule interface, where the result of parse_wkt
would be something like a ChunkedPointArray
, which would then have its own __arrow_c_stream__
method, which includes the field metadata.
It would be nice to have an equivalent in JS to the Python PyCapsule interface, but this is harder to do in JS because of the memory sandboxing between JS and Wasm. But it would be nice to solve this so that functions can transparently take in either existing wasm data or Arrow JS instances. I started the mocking/duck typing of Arrow JS here https://github.com/kylebarron/arrow-wasm/tree/main/src/arrow_js. That allows for directly copying Arrow JS into arrow-rs without going through IPC, and is kinda a reverse of https://github.com/kylebarron/arrow-js-ffi without having to implement writing to the C data interface from scratch.
- It is definitely impossible to return a geometry column of a defined type to JS solely through wasm-bindgen, originating from a dyn ChunkedGeometryArrayTrait - a wrapper with per-geometry, fallible
as_<geometry_type>
methods is necessary. Maybe one day we'll get enums that are allowed to contain data :-/.
It's not impossible if you do dynamic downcasting. E.g. in Python, the existing method on the table called def geometry(self)
returns one of the concrete classes, either ChunkedPointArray
or ChunkedLineStringArray
etc. In JS we'd do similar to return PointVector
etc.
See https://github.com/geoarrow/geoarrow-rs/blob/42aa07bb7e0bba48b123f0815432772eeae9c84f/python/core/src/table/mod.rs#L24 and https://github.com/geoarrow/geoarrow-rs/blob/42aa07bb7e0bba48b123f0815432772eeae9c84f/python/core/src/ffi/to_python/array.rs#L114-L152
- Geospatial operations that always return the same geometry type are perfectly fine, both directly on the GeoTable and... I'm calling it UnknownGeometryVector for the moment. Type-hinting would be quite good for those.
If we do dynamic downcasting, this shouldn't be an issue.
- sjoins would probably be equally as inscrutable as they usually are.
My expected sjoin API is
sjoin(left_table: Table, right_table: Table, left_geom: str | None, right_geom: str | None, how: "inner" | "outer") -> Table
If left_table
and right_table
only have one geometry column, then neither left_geom
nor right_geom
need to be provided.
Oops, you're indeed right (I was stuck between a combination of tunnel vision on returning enums and juggling the exact sequence of castings to get Arc$WasmVectorName(foo.as_ref().as_$geom_name().clone()).into()
does the trick :facepalm: . Cool, I'll override the type definition for that method too to the union of result Wasm vectors, that should sort it.
That sjoin API, that'd also include the predicate right? (within, intersects, etc).
That sjoin API, that'd also include the predicate right? (within, intersects, etc).
Yes, that too
Alright, so agreed re naming ('Table'), there's a few scenarios for how I'd see this working, with varying degrees of re-use/integration and usability:
The (very limited) spatial-related methods (presumably including one that details the geometry containing columns (index, name, detected geometry type) would need good handling for the spatial-free case of course.
IO methods return a JsValue, via an intermediate match-case conversion to geoarrow_wasm::table::Table or arrow_wasm::table::Table - this would require a way to convert between them in JS and deal with manual conversions, with something like:
const plainTable = parquetFile.read() // just an arrow_wasm table at this point, no spatially aware methods available
const geometryColumn = wasm.parseWkt(plainTable.column('wkt_geometry')) // one of the geoarrow-wasm geometry vectors
const outputTable = wasm.Table.from(plainTable).append_column(geometryColumn, 'geometry') // geoarrow-wasm's Table, geometry methods will only produce meaningful results after the append_column.
// alternatively
const outputTable = wasm.Table.from_arrow_and_geometry(plainTable, {'geometry': geometryColumn}) // or something to that effect
A feature-flag in arrow-wasm, adding a dependency on geoarrow-wasm (! cyclic dependency, cargo might be okay with these 🤷 ), enabling as little as possible. This would be required for geoparquet-wasm (and sibling) IO methods to return arrow_wasm::Table results that are capable of inferring and auto-casting this:
const geometryColumn = wasm.parseWkt(plainTable.column('wkt_geometry');
plainTable.append_column(geometryColumn, 'geometry');
to a geoarrow_wasm::Table, not an arrow_wasm::Table. The equivalent in reverse (outputTable.remove_column('geometry')
which should downgrade a geoarrow_wasm::Table to arrow_wasm::Table) is much less of a problem since arrow_wasm is already a dependency.
Option 1 is the least trouble, though it provides the least confidence that a geoarrow_wasm::Table's spatial methods will actually do anything.
Option 2 is very close to geopandas' and pyogrio's approach - IO methods usually produce GeoDataframes but can produce plain old pandas dataframes (e.g. read_geometry=False, no geometry column found, etc), pandas dataframes can be consumed into GeoDataframes explicitly (but are not auto-upgraded by adding a GeoSeries column to a pandas DataFrame), GeoDataframes are implicitly downgraded once they lose their geometry column (by column slicing, drop, overwriting with wkb, etc).
It would be nice to have an equivalent in JS to the Python PyCapsule interface, but this is harder to do in JS because of the memory sandboxing between JS and Wasm
Agreed that it's harder to do for objects crossing memory sandboxes, but you are pretty likely to have parse_wkt and Table::append_column functions as part of the same wasm module - still very useful to be able to glue them together with JsCapsules, but no memory sandboxing problems. That specific form should be pretty doable.
I'm not inclined to have separate structs for a Table in multiple crates. My API surface is large enough, that I don't want to have multiple objects, where we might have to re-implement non-spatial table operations on both Table
and a spatially-enabled Table
. Especially for Python/JS bindings where we can't use traits for the binding definitions.
I think my plan is none of the above. I don't want a geoarrow_wasm::Table
to need to exist. A GeoArrow table is "just" an Arrow table with geoarrow metadata on one or more geometry columns.
We can loop through column metadata to look for the geometry column, or the user can pass in the column name/index. This is how the current geoarrow::table::geometry
method works. It returns an Option<Arc<dyn ChunkedGeometryArrayTrait>>
.
To be clear, the geoarrow::table::Table
struct only exists in geoarrow
because the arrow
crate doesn't define a Table
object, but for parity with Python and JS, I feel it's important for that concept to exist.
So I think all geoarrow-wasm IO methods should return an arrow-wasm::Table
. And in this case I think having a top-level function geometry(table: Table, index?: null | string | number) -> Vector
makes sense, where a user can easily access the geometry column of a table.
Essentially, this pushes us more towards a functional API that takes in a table as the first argument rather than a method API. This is even more useful in Python where we can interpret any other Arrow object with zero copy. So we want to work with any arrow object, whether it's one we define or one defined by pyarrow or gdal or polars or duckdb or whatever.
Thinking about the Pandas/GeoPandas split, I'd argue that the big distinction here is that Pandas didn't originally have first-class support for extension types. You weren't able to add associated metadata to columns or tables.
I'm really not a fan of geopandas and pyogrio behavior here. I feel strongly that there should be a statically inferred output type for any input function. Perhaps there should be two overloads that return different output types for a different combination of input types, but at least it's known at compile time. I'm not fond of returning two different Python object types.
p.s. I'm exhausted as I'm typing this so it's possible I didn't explain myself as well as I could've. Feel free to ask more questions
Option 1 is the least trouble, though it provides the least confidence that a geoarrow_wasm::Table's spatial methods will actually do anything.
At some level, this question is: is it worth the development overhead of special-casing tons of geospatial-specific table/array objects when it could alternately be represented as a generic array + metadata?
I'm also pondering this around our current approach of having different exports for every geometry array type. When we add Z/M dimension support, should we have PointArrayZ
, PointArrayM
, PointArrayZM
exports as well? That's a huge amount of complexity.
E.g. on the Python side, I've been thinking a rust-python Arrow building block library like arrow-wasm might make sense. Then maybe we have generic Table
, Array
, ChunkedArray
, Field
and Schema
exports from that module and only GeometryArray
and ChunkedGeometryArray
exports from geoarrow.rust.core
. That would massively simplify exports (in theory maybe we don't even need a GeometryArray
class if we have a purely-functional API) at the cost of not having as great static typing.
But at this point, I'm coming to believe that reducing maintenance complexity will be massively needed. I believe these GeoArrow-based projects are a multi-year endeavor (thinking ~5 years to maturity) and I need to ensure that I don't burn myself out. So reducing geometry-specific class exports will hopefully both improve interoperability and reduce API surface area.
Excellent, that makes things much, much clearer.
Cool, so toy usage more or less look like:
// named exports semi-fantasised
import { geometry, centroid, convexHull, sjoin } from 'geoparquet-wasm';
// mandatory functional
const foo = geometry(table);
// mixed functional/method-based, what we have today on things like PointVector, etc
const intermediate = foo.convexHull().centroid() // semi-pointless, demonstrates method-chaining.
// fully functional approach
const intermediate = centroid(convexHull(foo)); // isomorphic with the above, easy to use with functional composition.
// mandatory functional
const intersectionTable = sjoin(table, otherTable, {how: 'inner', predicate: 'intersects'});
I need to ensure that I don't burn myself out
:point_up: hard agree on this :face_exhaling:
At some level, this question is: is it worth the development overhead of special-casing tons of geospatial-specific table/array objects when it could alternately be represented as a generic array + metadata?
I think cutting geospatial-specific tables is the right call, but the individual geometry type arrays are useful (and have the advantage of having already been implemented), and highly macro-able. There's also just enough variation between them (that is, a very small subset of relations/predicates that are undefined, nonsensical, or only defined on a given geometry type) to justify treating them separately.
Z/M dimension support
Yeah, independent Z, M, Z+M could be one that we throw up our hands and say "static typing on standard-ish geometry arrays is valuable enough to support.
Cool, so toy usage more or less look like:
Yes, absolutely.
I need to ensure that I don't burn myself out
☝️ hard agree on this 😮💨
🙏
I think cutting geospatial-specific tables is the right call, but the individual geometry type arrays are useful (and have the advantage of having already been implemented), and highly macro-able. There's also just enough variation between them (that is, a very small subset of relations/predicates that are undefined, nonsensical, or only defined on a given geometry type) to justify treating them separately.
I think that they will always be individual structs on the Rust side. Definitely big pros and cons to whether the bindings should have individual array exports. It's also possible that there should be a single GeometryArray
export that just stores a Arc<dyn GeometryArrayTrait>
but then have additional types on the typescript side. I'm not sure.
Yeah, independent Z, M, Z+M could be one that we throw up our hands and say "static typing on standard-ish geometry arrays is valuable enough to support. these? even with individually defined types, static typing is extremely likely to fail altogether when mixing them. Here's GeometryArrayZ, GeometryArrayM, GeometryArrayZM, best of luck".
yeah... just feels weird to have PointArray
but then GeometryArrayZ
being an enum over types on the rust side. but maybe that's a good option
I may have played around with a wasm-bindgen'd equivalent of the python GeoTable (notes on that in a bit) - that's one of the goals of the JS target, yeah?
Alright, notes so far:
as_<geometry_type>
methods is necessary. Maybe one day we'll get enums that are allowed to contain data :-/.A sample of some of the usage I've gotten out of it:
One thing that could avoid the constant
fooTable.geometry().as_polygon()
unpleasantness is an ES6 Proxy along the lines of:Spawning one of those on class/struct construction, that's one I'm pretty sure is doable.