geoarrow / geoarrow-python

Python implementation of the GeoArrow specification
http://geoarrow.org/geoarrow-python/
Apache License 2.0
66 stars 4 forks source link

Initial package scaffolding #1

Open kylebarron opened 1 year ago

kylebarron commented 1 year ago

In https://github.com/kylebarron/geoarrow-rs/pull/140 @paleolimbot and I were talking about how to lay out python geoarrow.

kylebarron commented 1 year ago

Thinking about it again, I think the biggest problem with the approach in this PR is that the returned object from the submodule is not necessarily the same class as the core class. I.e. are we going to require that the submodules depend on geoarrow.core and always return core classes?

Maybe it would be better to use structural subtyping and have the core package focus on protocols? Then each package could have its own implementation of a point array if desired, which implements the geoarrow.core.PointArray protocol.

Cool! I like the idea of having a geoarrow-verse. Bindings the C implementation would be geoarrow.c (right?), which might or might not be used by other components.

Yeah it could be named anything geoarrow.[name].

Even with a pyarrow dependency, I still think we want an abstract "ArrayStorage" class. For pyarrow, this might be an Array or a ChunkedArray. I would personally put all pyarrow-related implementations in import geoarrow.pyarrow (which would take care of other pyarow-specific details like registering the extension types). If you have to declare exactly one, I'd pick ChunkedArray because Array -> ChunkedArray is always zero copy (but often not the other way around).

Is your goal to separate pyarrow because of bundle size? Because it's a large dependency that some projects won't want?

These are valid concerns, but I'm not sure what an ArrayStorage class would hold? Or you're saying that's an ABC and you'd have pyarrow storage and nanoarrow storage on top of that?

I think it's important to have Array storage and not just ChunkedArray, because that ensures to the user (developer) that all geometries in this array are in contiguous memory.

I would personally use the terminology Series for what you have here (with Array as a wrapper around a pyarrow Array/ChunkedArray/maybe something else in the future). This distinction is roughly what both Pandas and Polars do (Arrow C++ also separates this but calls it ArrayData and Array).

I think here I have a different idea of the integration point between this package and other packages in the ecosystem. I wouldn't use the terminology Series because that confers extra levels of abstraction above a contiguous array of geometries. I see this as lower level than Pandas (and I don't think Polars should be considered here at all, because Polars will be much more efficient with a Rust binding).

I like this, which is similar to what pandas does with str and similar accessors (and what cuDF does for type-specific operations). I would maybe call geo georust but 🤷 .

Maybe rust is better. I'd prefer fewer characters

paleolimbot commented 1 year ago

I wonder if it's too early to scaffold an object-oriented approach to the Array here. At its heart, there are a lot of functions that accept something array-like and return something array-like (e.g., geoarrow.geos.buffer() or geoarrow.geos.length()). It may be that we don't need our own Array subclass here...pyarrow has it's own system for dealing with this (the ExtensionArray), as does pandas (the ExtensionArray/accessors) and datafusion and polars and any other dataframe APIs that may pop up.

paleolimbot commented 1 year ago

Also feel free to push forward an execute your vision here...a new array interface isn't something I'm all that passionate about but that's not to say it isn't valuable!

kylebarron commented 1 year ago

Yeah, I agree it's early and there are a lot of unknowns.

The reason I reach for an object oriented approach is that not all operations are implemented on every geometry type. E.g. linestring simplification might not be implemented for points, or a clustering algorithm might be implemented only for points. And it's nicer to have some IDE hinting for what operations can be used for which data type, especially since in arrow we have known strict typing.

Using arrow objects directly without any wrapping classes loses all typing support.

It might be too early to do any work on a core library. I'll push along my Python bindings to experiment with some different approaches