Open kylebarron opened 1 year ago
Thinking about it again, I think the biggest problem with the approach in this PR is that the returned object from the submodule is not necessarily the same class as the core class. I.e. are we going to require that the submodules depend on geoarrow.core
and always return core classes?
Maybe it would be better to use structural subtyping and have the core package focus on protocols? Then each package could have its own implementation of a point array if desired, which implements the geoarrow.core.PointArray
protocol.
Cool! I like the idea of having a geoarrow-verse. Bindings the C implementation would be
geoarrow.c
(right?), which might or might not be used by other components.
Yeah it could be named anything geoarrow.[name]
.
Even with a pyarrow dependency, I still think we want an abstract "ArrayStorage" class. For pyarrow, this might be an Array or a ChunkedArray. I would personally put all pyarrow-related implementations in
import geoarrow.pyarrow
(which would take care of other pyarow-specific details like registering the extension types). If you have to declare exactly one, I'd pickChunkedArray
because Array -> ChunkedArray is always zero copy (but often not the other way around).
Is your goal to separate pyarrow because of bundle size? Because it's a large dependency that some projects won't want?
These are valid concerns, but I'm not sure what an ArrayStorage
class would hold? Or you're saying that's an ABC
and you'd have pyarrow storage and nanoarrow storage on top of that?
I think it's important to have Array
storage and not just ChunkedArray
, because that ensures to the user (developer) that all geometries in this array are in contiguous memory.
I would personally use the terminology
Series
for what you have here (withArray
as a wrapper around a pyarrow Array/ChunkedArray/maybe something else in the future). This distinction is roughly what both Pandas and Polars do (Arrow C++ also separates this but calls it ArrayData and Array).
I think here I have a different idea of the integration point between this package and other packages in the ecosystem. I wouldn't use the terminology Series
because that confers extra levels of abstraction above a contiguous array of geometries. I see this as lower level than Pandas (and I don't think Polars should be considered here at all, because Polars will be much more efficient with a Rust binding).
I like this, which is similar to what pandas does with
str
and similar accessors (and what cuDF does for type-specific operations). I would maybe callgeo
georust
but 🤷 .
Maybe rust
is better. I'd prefer fewer characters
I wonder if it's too early to scaffold an object-oriented approach to the Array here. At its heart, there are a lot of functions that accept something array-like and return something array-like (e.g., geoarrow.geos.buffer()
or geoarrow.geos.length()
). It may be that we don't need our own Array subclass here...pyarrow has it's own system for dealing with this (the ExtensionArray
), as does pandas (the ExtensionArray
/accessors) and datafusion and polars and any other dataframe APIs that may pop up.
Also feel free to push forward an execute your vision here...a new array interface isn't something I'm all that passionate about but that's not to say it isn't valuable!
Yeah, I agree it's early and there are a lot of unknowns.
The reason I reach for an object oriented approach is that not all operations are implemented on every geometry type. E.g. linestring simplification might not be implemented for points, or a clustering algorithm might be implemented only for points. And it's nicer to have some IDE hinting for what operations can be used for which data type, especially since in arrow we have known strict typing.
Using arrow objects directly without any wrapping classes loses all typing support.
It might be too early to do any work on a core library. I'll push along my Python bindings to experiment with some different approaches
In https://github.com/kylebarron/geoarrow-rs/pull/140 @paleolimbot and I were talking about how to lay out python geoarrow.
geoarrow.core
is defined as an implicit namespace package. Soimport geoarrow
does nothing;import geoarrow.core
imports this package.PointArray
dataclass that wraps pyarrow. Maybe in the future we can remove pyarrow as a dependency, but for now it's simplest to have.geo
,geos
, andproj
, which should give typing autocompletions in an IDE as long as those namespace packages are installed.