geoarrow / geoarrow-python

Python implementation of the GeoArrow specification
http://geoarrow.org/geoarrow-python/
Apache License 2.0
59 stars 3 forks source link

Python GeoArrow Module Proposal #38

Open kylebarron opened 9 months ago

kylebarron commented 9 months ago

Python GeoArrow Module Proposal

The strength of Arrow is in its interoperability, and therefore I think it's worthwhile to discuss how to ensure all the pieces around geoarrow-python fit together really well.

(It's possible this should be written RFC-style as a PR to a docs folder in this repo?)

Goals:

This proposal is based around the new Arrow PyCapsule Interface, which allows libraries to safely interoperate data without memory leaks and without going through pyarrow. This is implemented in pyarrow as of v14+, work is underway to add it to arrow-rs, and presumably nanoarrow support is not too hard to implement.

Primarily Functional API

A functional API makes it easy to take in data without knowing its provenance. Implementations may choose to also implement methods on classes if desired to improve the API usability, but nothing should be implemented solely as a method.

Data Structures

These are the data structure concepts that I think need to be first-class. Each core implementation will implement classes that conform to one of these

GeometryArray

This is a logical array of contiguous memory that conforms to the GeoArrow spec. I envision there being PointArray, LineStringArray, etc. classes that are all subclasses of this.

This object should have an __arrow_c_array__ member that conforms to the PyCapsule interface. The exported ArrowSchema must include extension type information (an extension name of geoarrow.* and optionally extension metadata).

Whether the array uses small or large list offsets internally does not matter, but the implementation should respect the requested_schema parameter of the PyCapsule interface when exporting.

GeometryStorageArray?

In geoarrow-rs I've tried to make a distinction between "storage" types (i.e. WKB and WKT) and "analysis" types (i.e. anything zero-copy). This is partially to nudge users not to store data as WKB and operate directly on the WKB repeatedly. Do we want to make any spec-level distinction between storage and analysis arrays? Should every operation accept storage types? I think it should be fine for a function to declare it'll accept only non-storage types, and direct a user to call, say, parse_wkb.

ChunkedGeometryArray

I believe that chunked arrays need to be a first-class data concept. Chunking is core to the Arrow and Parquet ecosystems, and to handle something like unary_union that requires the entire column as input to a single kernel requires understanding some type of chunked input. I envision there being ChunkedPointArray, ChunkedLineStringArray, etc. classes that are all subclasses of this.

This should have an __arrow_c_stream__ member. The ArrowSchema must represent a valid GeoArrow geometry type and must include extension type information (at least a name of geoarrow.* and optionally extension metadata).

This stream should be compatible with Dewey's existing kernel structure that allows for pushing a sequence of arrays into the kernel.

(It looks like pyarrow doesn't implement __arrow_c_stream__ for a ChunkedArray? To me it seems natural for it to exist on a ChunkedArray... I'd be happy to open an issue.)

GeometryTable

For operations like joins, kernels need to be aware not only of geometries but also of attribute columns.

This should have an __arrow_c_stream__ member. The ArrowSchema must be a struct type that includes all fields in the table. The ArrowArray must be a struct array that includes all arrays in the table. At least one child of the ArrowSchema must have GeoArrow extension type information (an extension name of geoarrow.* and optionally extension metadata).

Future proofing

Spatial indexes can be serialized within a table or geometry array by having a struct containing the geometry column and a binary-typed run end array holding the bytes of the index (pending geoarrow discussion).

Not sure what other future proofing to consider.

Module hierarchy

General things to consider:

geoarrow.pyarrow

geoarrow.pandas

geoarrow.shapely

geoarrow.gdal

Wrapper around pyogrio?

geoarrow.c

I'll let dewey give his thoughts here.

geoarrow.rust.core

geoarrow.rust.proj, geoarrow.rust.geos

Downsides

Static Typing

A full proposal for static typing is out of the scope of this proposal (and some operations just won't be possible to type accurately).

A few methods will be amenable to generics, as shown below. But ideally every function can be given a return type that matches one of the Arrow PyCapsule protocols. At least in the Rust implementation, I'd like to have type stubs that accurately return type classes (though sadly I'll still have to write the .pyi type stubs by hand).

from typing import Protocol, Tuple, TypeVar, reveal_type

class ArrowArrayExportable(Protocol):
    def __arrow_c_array__(
        self, requested_schema: object | None = None
    ) -> Tuple[object, object]:
        ...

class ArrowStreamExportable(Protocol):
    def __arrow_c_stream__(self, requested_schema: object | None = None) -> object:
        ...

ArrayT = TypeVar("ArrayT", bound=ArrowArrayExportable)
StreamT = TypeVar("StreamT", bound=ArrowStreamExportable)

class PointArray:
    def __arrow_c_array__(
        self, requested_schema: object | None = None
    ) -> Tuple[object, object]:
        ...

class ChunkedPointArray:
    def __arrow_c_stream__(self, requested_schema: object | None = None) -> object:
        ...

def translate(array: ArrayT | StreamT, x: float, y: float) -> ArrayT | StreamT:
    ...

p = PointArray()
p2 = translate(p, 1, 1)
reveal_type(p2)
# Type of "p2" is "PointArray"

cp = ChunkedPointArray()
cp2 = translate(cp, 1, 1)
reveal_type(cp2)
# Type of "cp2" is "ChunkedPointArray"
paleolimbot commented 9 months ago

This is awesome! All comments I have are just details.

It looks like pyarrow doesn't implement arrow_c_stream for a ChunkedArray?

It definitely should! Opening an issue would be helpful (I can take a stab at implementing).

Primarily Functional API

Agreed! It may be nice to (optionally) expose something like the Kernel class (maybe called something else like Algorithm or Operation or Function). Notably, this will let you predict the output type (in the Arrow sense) of the operation given zero or more input types. This doesn't help you with static typing in the IDE sense but it is pretty useful and hard to replicate in a purely functional interface. Maybe this is a dunder on the function (i.e., "functions" would actually be objects implementing __call__() for humans and __geoarrow_output_type__() or something for UDF registration or whatever else needs that kind of information).

In addition to output type, something describing the input/output shape relationship would be nice (scalar function, vector function, aggregate function, window function).

Data structures

I think in practice we might want to consider anything "geoarrowish" (i.e., it will work when passed to any Python function in geoarrow land) that implements __arrow_c_schema__ (returning a geoarrow extension type) and implements __arrow_c_stream__ or __arrow_c_array__. In Rust it will probably be helpful/easy for you to implement the classes you describe, and it will probably be very helpful for all of us! I don't think I can do that in geoarrow.c, though...I will probably just return some very minimal object that implements the requisite dunder methods. geoarrow.pyarrow already implements separate classes for separate types, but because of the chunked thing it can never guarantee anything about the return value of any function except Array | ChunkedArray.

I'd add that a Type representation is another first-class data structure (implementing __arrow_c_schema__). This is needed for communicating the output type of something.

geoarrow.pyarrow

In addition to what you listed, I think other things that use pyarrow features (e.g., the parquet reader or datasets) are in scope. If the proportion of code dedicated to those features starts to overwhelm the code base I can see that changing, but at the moment it seems like a lot of work to move (e.g.) the GeoParquet reader somewhere else when it is really a very thin wrapper around PyArrow's reader.

I'd love to remove the hard dependency on geoarrow.c, although I also don't have to do that right now (happy to review a PR!).

The compute functionality (i.e., as_wkb()) is also in scope I believe: we want users to actually create geoarrow-encoded arrays and tables...it's unreasonable (I think) to force a user to consider exactly which subpackage they have to install to get a geoarrow-encoded version of POINT (0 1) or whatever. A lazy import for the actual implementations can remove a hard dependency on geoarrow.c or geoarrow.rust.whatever.

geoarrow.pandas

Pretty much just to register the geoarrow accessor with any/all transformations that make sense as accessors, as you noted. The extension array would just be a generic wrapper around anything with the properties noted in the "data structures" section. A more practical purpose of this is that the to_pandas() method is ungodly slow if we don't have something to return.

This is a place where you could make a PointSeries/LinestringSeries etc that conform to the array spec you described. I'd maybe rather see that in geopandas than in geoarrow.pandas and I definitely can't commit to spending time doing anything about it, but because you can have control over the Series subclass, you could define methods on the series subclasses and the autocomplete would be awesome.

geoarrow.shapely

Probably this can just live in Shapely?

geoarrow.gdal

Probably this can just live in pyogrio and/or gdal.ogr?

geoarrow.c

Pretty much the same scope as it has now except now it can leverage the dunder methods to maybe export some functions or Kernel/Operation/Algorithm.

Static Typing

You'll have to lead the way here (by adding it in yourself or letting me know in a PR review where I should have added useful type annotations)...because of the mostly functional approach and garbage autocomplete of pyarrow, this is hard to do at the core level. At the pandas or Rust or polars level this is more feasible?

...I probably forgot some things! Apologies if these comments missed something about what you are proposing!

kylebarron commented 9 months ago

It looks like pyarrow doesn't implement arrow_c_stream for a ChunkedArray?

It definitely should! Opening an issue would be helpful (I can take a stab at implementing).

Opened https://github.com/apache/arrow/issues/38717!

Primarily Functional API

Agreed! It may be nice to (optionally) expose something like the Kernel class (maybe called something else like Algorithm or Operation or Function). Notably, this will let you predict the output type (in the Arrow sense) of the operation given zero or more input types.

Do you have pseudocode of this? It's a little hard to wrap my head around. By output type do you mean the geometry type (point, line, etc) or array/chunked array/table?

from geoarrow.python.c import Area

operation = Area()
array = PolygonArray(...)
float_array = operation(array)

In addition to output type, something describing the input/output shape relationship would be nice (scalar function, vector function, aggregate function, window function).

One thing I've been doing in Rust is enabling some simple broadcasting. So you should be able to do intersects(array, other_array) just as easily as intersects(array, scalar_geom). In that sense, you'd be able to describe output shapes via static typing... (overloads for each) not sure exactly what you're thinking of.

Data structures

I think in practice we might want to consider anything "geoarrowish" (i.e., it will work when passed to any Python function in geoarrow land) that implements __arrow_c_schema__ (returning a geoarrow extension type) and implements __arrow_c_stream__ or __arrow_c_array__.

👍

In Rust it will probably be helpful/easy for you to implement the classes you describe, and it will probably be very helpful for all of us! I don't think I can do that in geoarrow.c, though...I will probably just return some very minimal object that implements the requisite dunder methods.

Ok this is good to know. I agree not every implementation needs those concrete classes; it's helpful for me to think of those classes in a logical sense, and on the rust side I think I'll end up implementing all of them as classes.

I'd add that a Type representation is another first-class data structure (implementing __arrow_c_schema__). This is needed for communicating the output type of something.

👍

geoarrow.pyarrow

In addition to what you listed, I think other things that use pyarrow features (e.g., the parquet reader or datasets) are in scope. If the proportion of code dedicated to those features starts to overwhelm the code base I can see that changing, but at the moment it seems like a lot of work to move (e.g.) the GeoParquet reader somewhere else when it is really a very thin wrapper around PyArrow's reader.

👍 I agree, I just forgot that there were other elements in there already than the extension types and arrays.

I'd love to remove the hard dependency on geoarrow.c, although I also don't have to do that right now (happy to review a PR!).

The compute functionality (i.e., as_wkb()) is also in scope I believe: we want users to actually create geoarrow-encoded arrays and tables...it's unreasonable (I think) to force a user to consider exactly which subpackage they have to install to get a geoarrow-encoded version of POINT (0 1) or whatever. A lazy import for the actual implementations can remove a hard dependency on geoarrow.c or geoarrow.rust.whatever.

hmm, I'm not sure what the best approach is here. I was originally thinking of geoarrow.c.as_wkb, but receptive to other ideas.

I will mention that we've had somewhat different priorities in geoarrow-c and geoarrow-rs so far, where it seems like you've been more intentional about supporting the full spec, whereas I only support 2D geometries right now.

geoarrow.pandas

Pretty much just to register the geoarrow accessor with any/all transformations that make sense as accessors, as you noted. The extension array would just be a generic wrapper around anything with the properties noted in the "data structures" section. A more practical purpose of this is that the to_pandas() method is ungodly slow if we don't have something to return.

This is a place where you could make a PointSeries/LinestringSeries etc that conform to the array spec you described. I'd maybe rather see that in geopandas than in geoarrow.pandas and I definitely can't commit to spending time doing anything about it, but because you can have control over the Series subclass, you could define methods on the series subclasses and the autocomplete would be awesome.

👍 That's cool! I also don't see myself spending time on this in the short term, but an interesting API to think about.

geoarrow.shapely

Probably this can just live in Shapely?

I don't see shapely having any interest in a pyarrow dependency. Maybe shapely would consider an optional dependency on pyarrow, but I would've guessed it's easier to have a separate package to wrap shapely.

geoarrow.gdal

Probably this can just live in pyogrio and/or gdal.ogr?

Here you're right; given that pyogrio already depends on pyarrow, maybe it would be easiest to add a geoarrow-python dependency to pyogrio and ensure it's loading to geoarrow extension types?

geoarrow.c

Pretty much the same scope as it has now except now it can leverage the dunder methods to maybe export some functions or Kernel/Operation/Algorithm.

Static Typing

You'll have to lead the way here (by adding it in yourself or letting me know in a PR review where I should have added useful type annotations)...because of the mostly functional approach and garbage autocomplete of pyarrow, this is hard to do at the core level. At the pandas or Rust or polars level this is more feasible?

I agree it's hard to do with pyarrow (once upon a time I spent a few hours trying to make type stubs for pyarrow!) but from Rust it should be pretty feasible!

paleolimbot commented 9 months ago

Do you have pseudocode of this?

I have actual code!

import geoarrow.pyarrow as ga
op = ga.Kernel.as_wkb(ga.wkt())
op._type_out
#> WkbType(geoarrow.wkb)

I was originally thinking of geoarrow.c.as_wkb, but receptive to other ideas.

geoarrow.c.as_wkb() can only ever return a sort of dummy object that implements __arrow_c_stream__(). I think it's reasonable that a pyarrow user (or developer writing something on top of pyarrow) would like a version of those functions that accept Array/ChunkedArray and return pyarrow objects. It currently works with a few pyarrow versions (I should properly test the minimum supported one)...anything using the dunder interface isn't useful for anything except the latest and greatest.

I don't see shapely having any interest in a pyarrow dependency.

No need for a pyarrow dependency?

import geoarrow.c.lib as lib
import numpy as np

type = lib.CVectorType.Make(lib.GeometryType.POINT, lib.Dimensions.XY, lib.CoordType.SEPARATE)
schema = type.to_schema()
builder = lib.CBuilder(schema)
builder.set_buffer_double(1, np.array([1.0, 2.0, 3.0]))
builder.set_buffer_double(2, np.array([3.0, 4.0, 4.0]))
array = builder.finish()
array
#> <geoarrow.c._lib.ArrayHolder at 0x10c0e3690>
# Just for demo:
import pyarrow as pa
import geoarrow.pyarrow as _
pa.Array._import_from_c(array._addr(), schema._addr())
#> PointArray:PointType(geoarrow.point)[3]
#> <POINT (1 3)>
#> <POINT (2 3)>
#> <POINT (3 4)>

...or a lazy geoarrow.rs import when something similar is supported there.

jorisvandenbossche commented 9 months ago

Thanks for starting this discussion, Kyle!

Don't have time to respond thoroughly right now, so just two quick things:

geoarrow.gdal

Probably this can just live in pyogrio and/or gdal.ogr?

Here you're right; given that pyogrio already depends on pyarrow, maybe it would be easiest to add a geoarrow-python dependency to pyogrio and ensure it's loading to geoarrow extension types?

I think that's indeed a reasonable proposal for pyogrio. I think the main question is how to specify this dependency. At the moment pyarrow is only an optional dependency for pyogrio, and python doesn't really have a way to specify that "geoarrow-pyarrow is a required additional dependency in case that the optional dependency pyarrow is installed", except for asking users to explicitly install this additional package whenever they also have pyarrow installed ..

geoarrow.shapely

Probably this can just live in Shapely?

I don't see shapely having any interest in a pyarrow dependency. Maybe shapely would consider an optional dependency on pyarrow, but I would've guessed it's easier to have a separate package to wrap shapely.

I think on the short term, it might also make sense to have a version of this in geoarrow.pyarrow, certainly if initially this would return/accept pyarrow arrays. In the end, shapely already supports this (mostly) with to/from_ragged_array, and so what you need to implement is the boiler plate to go from your Arrow object to the numpy arrays shapely needs/returns, and this boiler plate is going to be almost entirely library specific (i.e. writing this in Python for gearrow.pyarrow will, I think, almost entirely use pyarrow-specific APIs). At least as long as we initially use those existing functions and do this in Python, longer term it might be interesting to think about whether shapely could directly support the C Data Interface.

kylebarron commented 9 months ago

longer term it might be interesting to think about whether shapely could directly support the C Data Interface

That would be really cool, especially if it could vendor nanoarrow (?) geoarrow-c (?) and not need to add a complex pyarrow dependency that I imagine some shapely maintainers might balk at. I would love this to exist but can't say I have the C/Cython knowledge and motivation to make it happen myself 🙂

paleolimbot commented 9 months ago

especially if it could vendor nanoarrow (?) geoarrow-c (?)

I was thinking an optional runtime dependency on geoarrow-c (which can create a C-data interface array from pybuffers). I'm pretty game to give that a shot (has overlap with R things I want to do) but it may take a bit (and need some changes to geoarrow-c to be future-stable).

I was originally thinking of geoarrow.c.as_wkb, but receptive to other ideas.

How about moving them to geoarrow.pyarrow.compute? It gives them a tiny bit of separation (just like the io and dataset modules) and will probably make it easier to separate if needed at some point. It would be fairly easy to do (just rename _compute to compute and remove the function imports).

It seems like the some things I should prioritize here are implementing the stream protocol for ChunkedArrays and implement the protocol in objects returned by geoarrow.c?

kylebarron commented 9 months ago

It seems like the some things I should prioritize here are implementing the stream protocol for ChunkedArrays and implement the protocol in objects returned by geoarrow.c?

I'd agree with that. One of my priorities is implementing the core pycapsule protocol in geoarrow-rs (I have to re-implement it from the arrow-rs version because arrow-rs doesn't maintain extenson type information)

kylebarron commented 9 months ago

With https://github.com/geoarrow/geoarrow-rs/pull/293 I now have a working prototype (of an area method) based on the pycapsule interface. I haven't tested it yet on a pyarrow extension array, but it worked on the rust-based class through the pycapsule interface.