geoarrow / geoarrow-rs

GeoArrow in Rust, Python, and JavaScript (WebAssembly) with vectorized geometry operations
http://geoarrow.org/geoarrow-rs/
Apache License 2.0
243 stars 14 forks source link

Equivalent GeoTable abstraction for JS target #620

Open H-Plus-Time opened 4 months ago

H-Plus-Time commented 4 months ago

I may have played around with a wasm-bindgen'd equivalent of the python GeoTable (notes on that in a bit) - that's one of the goals of the JS target, yeah?

Alright, notes so far:

  1. It is definitely impossible to return a geometry column of a defined type to JS solely through wasm-bindgen, originating from a dyn ChunkedGeometryArrayTrait - a wrapper with per-geometry, fallible as_<geometry_type> methods is necessary. Maybe one day we'll get enums that are allowed to contain data :-/.
  2. Geospatial operations that always return the same geometry type are perfectly fine, both directly on the GeoTable and... I'm calling it UnknownGeometryVector for the moment. Type-hinting would be quite good for those.
  3. sjoins would probably be equally as inscrutable as they usually are.

A sample of some of the usage I've gotten out of it:

const dataset = await new ParquetDataset(urls);
const table = await dataset.read(readOptions);
const basicMetrics = [table.centroid(), table.convexHull()];
const geometryColumn = table.geometry().as_polygon();
const directlyAccessedMetrics = [
  geometryColumn.centroid(), geometryColumn.convexHull()
];

One thing that could avoid the constant fooTable.geometry().as_polygon() unpleasantness is an ES6 Proxy along the lines of:

const proxiedTableInstance = new Proxy(tableInstance, {
    get(target, property) {
        if (property === 'geometry_column') {
            // Intercept calls to geometry_column() and automatically call as_<geometryType>()
            return (...args) => {
                const geometryObject = target.geometry_column(...args);
                const asMethod = `as_${target.geometryType}`;
                if (typeof geometryObject[asMethod] === 'function') {
                    return geometryObject[asMethod]();
                } else {
                    throw new Error(`Method ${asMethod} not found in geometry object.`);
                }
            };
        }
        return target[property];
    },
});

Spawning one of those on class/struct construction, that's one I'm pretty sure is doable.

kylebarron commented 4 months ago

Thanks for the issue! I definitely welcome some thoughts on the API design.

I may have played around with a wasm-bindgen'd equivalent of the python GeoTable (notes on that in a bit) - that's one of the goals of the JS target, yeah?

I'm not sure if you've noticed, but I've renamed the underlying Rust struct to Table and was planning to also rename the Python class to Table as well. This enables a few things:

  1. It is definitely impossible to return a geometry column of a defined type to JS solely through wasm-bindgen, originating from a dyn ChunkedGeometryArrayTrait - a wrapper with per-geometry, fallible as_<geometry_type> methods is necessary. Maybe one day we'll get enums that are allowed to contain data :-/.

It's not impossible if you do dynamic downcasting. E.g. in Python, the existing method on the table called def geometry(self) returns one of the concrete classes, either ChunkedPointArray or ChunkedLineStringArray etc. In JS we'd do similar to return PointVector etc.

See https://github.com/geoarrow/geoarrow-rs/blob/42aa07bb7e0bba48b123f0815432772eeae9c84f/python/core/src/table/mod.rs#L24 and https://github.com/geoarrow/geoarrow-rs/blob/42aa07bb7e0bba48b123f0815432772eeae9c84f/python/core/src/ffi/to_python/array.rs#L114-L152

  1. Geospatial operations that always return the same geometry type are perfectly fine, both directly on the GeoTable and... I'm calling it UnknownGeometryVector for the moment. Type-hinting would be quite good for those.

If we do dynamic downcasting, this shouldn't be an issue.

  1. sjoins would probably be equally as inscrutable as they usually are.

My expected sjoin API is

sjoin(left_table: Table, right_table: Table, left_geom: str | None, right_geom: str | None, how: "inner" | "outer") -> Table

If left_table and right_table only have one geometry column, then neither left_geom nor right_geom need to be provided.

H-Plus-Time commented 4 months ago

Oops, you're indeed right (I was stuck between a combination of tunnel vision on returning enums and juggling the exact sequence of castings to get Arc down to JsValue) - a match statement filled with $WasmVectorName(foo.as_ref().as_$geom_name().clone()).into() does the trick :facepalm: . Cool, I'll override the type definition for that method too to the union of result Wasm vectors, that should sort it.

That sjoin API, that'd also include the predicate right? (within, intersects, etc).

kylebarron commented 4 months ago

That sjoin API, that'd also include the predicate right? (within, intersects, etc).

Yes, that too

H-Plus-Time commented 4 months ago

Alright, so agreed re naming ('Table'), there's a few scenarios for how I'd see this working, with varying degrees of re-use/integration and usability:

Option 1 - IO methods always return geoarrow_wasm::Table, with fallible geometry methods

The (very limited) spatial-related methods (presumably including one that details the geometry containing columns (index, name, detected geometry type) would need good handling for the spatial-free case of course.

Option 2 - explicit geoarrow::Table construction from arrow_wasm::Table, mostly infallible geometry methods

IO methods return a JsValue, via an intermediate match-case conversion to geoarrow_wasm::table::Table or arrow_wasm::table::Table - this would require a way to convert between them in JS and deal with manual conversions, with something like:

const plainTable = parquetFile.read() // just an arrow_wasm table at this point, no spatially aware methods available
const geometryColumn = wasm.parseWkt(plainTable.column('wkt_geometry')) // one of the geoarrow-wasm geometry vectors
const outputTable = wasm.Table.from(plainTable).append_column(geometryColumn, 'geometry') // geoarrow-wasm's Table, geometry methods will only produce meaningful results after the append_column.
// alternatively
const outputTable = wasm.Table.from_arrow_and_geometry(plainTable, {'geometry': geometryColumn}) // or something to that effect

Option 3 - arrow_wasm knowledge of geoarrow_wasm, bidirectional auto-conversion

A feature-flag in arrow-wasm, adding a dependency on geoarrow-wasm (! cyclic dependency, cargo might be okay with these 🤷 ), enabling as little as possible. This would be required for geoparquet-wasm (and sibling) IO methods to return arrow_wasm::Table results that are capable of inferring and auto-casting this:

const geometryColumn = wasm.parseWkt(plainTable.column('wkt_geometry');
plainTable.append_column(geometryColumn, 'geometry');

to a geoarrow_wasm::Table, not an arrow_wasm::Table. The equivalent in reverse (outputTable.remove_column('geometry') which should downgrade a geoarrow_wasm::Table to arrow_wasm::Table) is much less of a problem since arrow_wasm is already a dependency.


Option 1 is the least trouble, though it provides the least confidence that a geoarrow_wasm::Table's spatial methods will actually do anything.

Option 2 is very close to geopandas' and pyogrio's approach - IO methods usually produce GeoDataframes but can produce plain old pandas dataframes (e.g. read_geometry=False, no geometry column found, etc), pandas dataframes can be consumed into GeoDataframes explicitly (but are not auto-upgraded by adding a GeoSeries column to a pandas DataFrame), GeoDataframes are implicitly downgraded once they lose their geometry column (by column slicing, drop, overwriting with wkb, etc).

H-Plus-Time commented 4 months ago

It would be nice to have an equivalent in JS to the Python PyCapsule interface, but this is harder to do in JS because of the memory sandboxing between JS and Wasm

Agreed that it's harder to do for objects crossing memory sandboxes, but you are pretty likely to have parse_wkt and Table::append_column functions as part of the same wasm module - still very useful to be able to glue them together with JsCapsules, but no memory sandboxing problems. That specific form should be pretty doable.

kylebarron commented 4 months ago

I'm not inclined to have separate structs for a Table in multiple crates. My API surface is large enough, that I don't want to have multiple objects, where we might have to re-implement non-spatial table operations on both Table and a spatially-enabled Table. Especially for Python/JS bindings where we can't use traits for the binding definitions.

I think my plan is none of the above. I don't want a geoarrow_wasm::Table to need to exist. A GeoArrow table is "just" an Arrow table with geoarrow metadata on one or more geometry columns.

We can loop through column metadata to look for the geometry column, or the user can pass in the column name/index. This is how the current geoarrow::table::geometry method works. It returns an Option<Arc<dyn ChunkedGeometryArrayTrait>>.

To be clear, the geoarrow::table::Table struct only exists in geoarrow because the arrow crate doesn't define a Table object, but for parity with Python and JS, I feel it's important for that concept to exist.

So I think all geoarrow-wasm IO methods should return an arrow-wasm::Table. And in this case I think having a top-level function geometry(table: Table, index?: null | string | number) -> Vector makes sense, where a user can easily access the geometry column of a table.

Essentially, this pushes us more towards a functional API that takes in a table as the first argument rather than a method API. This is even more useful in Python where we can interpret any other Arrow object with zero copy. So we want to work with any arrow object, whether it's one we define or one defined by pyarrow or gdal or polars or duckdb or whatever.


Thinking about the Pandas/GeoPandas split, I'd argue that the big distinction here is that Pandas didn't originally have first-class support for extension types. You weren't able to add associated metadata to columns or tables.

I'm really not a fan of geopandas and pyogrio behavior here. I feel strongly that there should be a statically inferred output type for any input function. Perhaps there should be two overloads that return different output types for a different combination of input types, but at least it's known at compile time. I'm not fond of returning two different Python object types.


p.s. I'm exhausted as I'm typing this so it's possible I didn't explain myself as well as I could've. Feel free to ask more questions

kylebarron commented 4 months ago

Option 1 is the least trouble, though it provides the least confidence that a geoarrow_wasm::Table's spatial methods will actually do anything.

At some level, this question is: is it worth the development overhead of special-casing tons of geospatial-specific table/array objects when it could alternately be represented as a generic array + metadata?

I'm also pondering this around our current approach of having different exports for every geometry array type. When we add Z/M dimension support, should we have PointArrayZ, PointArrayM, PointArrayZM exports as well? That's a huge amount of complexity.

E.g. on the Python side, I've been thinking a rust-python Arrow building block library like arrow-wasm might make sense. Then maybe we have generic Table, Array, ChunkedArray, Field and Schema exports from that module and only GeometryArray and ChunkedGeometryArray exports from geoarrow.rust.core. That would massively simplify exports (in theory maybe we don't even need a GeometryArray class if we have a purely-functional API) at the cost of not having as great static typing.

But at this point, I'm coming to believe that reducing maintenance complexity will be massively needed. I believe these GeoArrow-based projects are a multi-year endeavor (thinking ~5 years to maturity) and I need to ensure that I don't burn myself out. So reducing geometry-specific class exports will hopefully both improve interoperability and reduce API surface area.

H-Plus-Time commented 4 months ago

Excellent, that makes things much, much clearer.

Cool, so toy usage more or less look like:

// named exports semi-fantasised
import { geometry, centroid, convexHull, sjoin } from 'geoparquet-wasm';
// mandatory functional
const foo = geometry(table);

// mixed functional/method-based, what we have today on things like PointVector, etc
const intermediate = foo.convexHull().centroid() // semi-pointless, demonstrates method-chaining.

// fully functional approach
const intermediate = centroid(convexHull(foo)); // isomorphic with the above, easy to use with functional composition.

// mandatory functional
const intersectionTable = sjoin(table, otherTable, {how: 'inner', predicate: 'intersects'});

I need to ensure that I don't burn myself out

:point_up: hard agree on this :face_exhaling:

At some level, this question is: is it worth the development overhead of special-casing tons of geospatial-specific table/array objects when it could alternately be represented as a generic array + metadata?

I think cutting geospatial-specific tables is the right call, but the individual geometry type arrays are useful (and have the advantage of having already been implemented), and highly macro-able. There's also just enough variation between them (that is, a very small subset of relations/predicates that are undefined, nonsensical, or only defined on a given geometry type) to justify treating them separately.

Z/M dimension support

Yeah, independent Z, M, Z+M could be one that we throw up our hands and say "static typing on standard-ish geometry arrays is valuable enough to support. these? even with individually defined types, static typing is extremely likely to fail altogether when mixing them. Here's GeometryArrayZ, GeometryArrayM, GeometryArrayZM, best of luck".

kylebarron commented 4 months ago

Cool, so toy usage more or less look like:

Yes, absolutely.

I need to ensure that I don't burn myself out

☝️ hard agree on this 😮‍💨

🙏

I think cutting geospatial-specific tables is the right call, but the individual geometry type arrays are useful (and have the advantage of having already been implemented), and highly macro-able. There's also just enough variation between them (that is, a very small subset of relations/predicates that are undefined, nonsensical, or only defined on a given geometry type) to justify treating them separately.

I think that they will always be individual structs on the Rust side. Definitely big pros and cons to whether the bindings should have individual array exports. It's also possible that there should be a single GeometryArray export that just stores a Arc<dyn GeometryArrayTrait> but then have additional types on the typescript side. I'm not sure.

Yeah, independent Z, M, Z+M could be one that we throw up our hands and say "static typing on standard-ish geometry arrays is valuable enough to support. these? even with individually defined types, static typing is extremely likely to fail altogether when mixing them. Here's GeometryArrayZ, GeometryArrayM, GeometryArrayZM, best of luck".

yeah... just feels weird to have PointArray but then GeometryArrayZ being an enum over types on the rust side. but maybe that's a good option