geopandas / community

GeoPandas Community Docs and Discussions
6 stars 0 forks source link

Support for spherical geometry (S2) in Python #10

Open jorisvandenbossche opened 2 years ago

jorisvandenbossche commented 2 years ago

General issue to track the idea of improving the support for working with spherical geometries (instead of planar) in GeoPandas and more generally the Python geospatial ecosystem.

Some background links:

jorisvandenbossche commented 2 years ago

And the previous community meeting, we had a discussion about this with @benbovy. Pasting the notes from that meeting here:

benbovy commented 2 years ago

Some updates on this:

I've tried using pybind11 and xtensor-python to implement some vectorized functions working on numpy object arrays where array elements are Python-wrapped s2 objects (i.e., an approach similar to the one used in pygeos). Without success so far, unfortunately (see https://github.com/benbovy/pys2index/pull/4).

There has been some effort to support numpy object arrays and/or ufuncs in pybind11 (e.g., https://github.com/pybind/pybind11/pull/1152 and also this fork). I think it would really help, although it is not yet clear to me if/how we can adapt it in this case. I left a comment there: https://github.com/pybind/pybind11/pull/1152#issuecomment-987707674.

brendan-ward commented 2 years ago

Thanks for the update @benbovy !

I haven't been able to fully get my head around the use of pybind11 / xtensor-python parts here, but a couple thoughts came to mind: 1) a common use would be in GeoPandas, where a geometry array is 1-dimensional. So supporting n-dimensional ufuncs is maybe a nice-to-have rather than must have? Thus the essential bit is really just a loop over a 1D array of objects (pointers to wrapped S2 entities). It seems like we could leverage numpy to do the broadcasting before calling into the pybind11 / C++ tier? (not sure, haven't really thought this out)

2) maybe it would be easier to leverage numpy C API (i.e., meet it where it's at), as is done in pygeos, and then build this as a Python C extension a bit more manually (like pygeos) rather than via pybind11? To do this, I think we'd have to create a C API around S2, so in addition to the boilerplate required for Python C extension there would also be some maintenance burden on wrapping S2 as a C API. One advantage of this approach is that it should enable numpy ufuncs in the C tier, much like pygeos. This would need some investigation into how S2 might be best wrapped as a C API, but we could theoretically leverage some of the recent patterns worked out in GEOS (passing coordinate buffers instead of setting individually). Plus there might be some advantage of having similar patterns at the lower (C) level between the two implementations on top of underlying geometry engines (GEOS and S2); this might be able to better leverage contributors between the pygeos / shapely side and S2 bindings side. (though might be trying too hard to make this analogous to pygeos)

martinfleis commented 2 years ago

@benbovy on a less technical note, if you still have a capacity and an interest in applying for the NumFOCUS SDG, the timeline for the next round is following:

If we should go ahead, I suggest organising a call in the first half of February to discuss the details.

benbovy commented 2 years ago

Thanks for your feedback @brendan-ward !

A common use would be in GeoPandas, where a geometry array is 1-dimensional. So supporting n-dimensional ufuncs is maybe a nice-to-have rather than must have? Thus the essential bit is really just a loop over a 1D array of objects (pointers to wrapped S2 entities)

I agree, we could certainly start with 1-dimensional arrays. That said, longer term I'd be interested in exploring the possibility of reusing n-dimensional vectorized S2 Python wrappers (as well as pygeos) with Xarray, which is undergoing some heavy refactoring to allow custom indexes. Xarray + geospatial indexes would be great for processing outputs of climate / ocean / landscape evolution / earth system models.

I think we'd have to create a C API around S2

That's my main concern. Unfortunately, there's no s2_c.h equivalent of geos_c.h as pointed out by @paleolimbot (https://youtu.be/zqRhF2XM1CE?t=1197). See also https://github.com/r-spatial/s2/issues/160. Long term it would be great to have a common S2 C API that can be maintained by both the Python and R (and maybe other) geospatial communities, but this represents quite a big task IMHO. Also, it's best if we can avoid vendoring S2 in a C/Python wrapper.

On the other hand, if we can make this possible without much effort, simply using pybind11 (and xtensor-python) to create the Python wrappers would allow us to develop new features / iterate very quickly and would avoid the burden of maintaining an intermediate C API + dealing with the low-level Numpy C API (+ vendoring S2). It not yet clear to me what is needed to make this approach work, though. It would be great to have thoughts from folks at QuantStack. @wolfv, have you ever thought about adding vectorized functions / ufuncs working with numpy arrays of wrapped CGAL objects (as np.object dtype) in scikit-geometry?

benbovy commented 2 years ago

@martinfleis that sounds good. I'm happy to discuss the details within the next two weeks.

Before going further with a NumFOCUS application, I'd like to clarify the technical challenges of the C API vs. pybind11 approaches. TBH I'm not sure to have the full capacity of going deep into either the design of a S2 C API or heavy customization of pybind11 internals, especially within the context of a NumFOCUS small development grant.

brendan-ward commented 2 years ago

@paleolimbot I'd be curious about your thoughts on a shared C API against S2 that could be leveraged for both Python, sf, and anything else that can use a C API. It looks like some of the code already in sf could be leveraged to get this started (per your thinking in r-spatial/s2#160).

Specifically, if this is something of interest to the broader community, should we try and take this up directly with S2? Long term, it seems ideal that the C API lives inside that project, as it does with GEOS, and given the approach with GEOS it seems also reasonable that it not be expected to have full coverage of the underlying C++ API, at least not initially.

However, that might lead to other challenges:

Since it looks like sf has already taken the vendor and partially wrap with C API approach, we'd be curious about the maintenance effort there as we think through our approach here.

jorisvandenbossche commented 2 years ago

From watching the FOSS4G talk of @paleolimbot mentioned above (https://www.youtube.com/watch?v=zqRhF2XM1CE), it is my understanding that the S2 Geometry library itself it quite low level, and all the functionality to work on "simple features" (as what we get from GEOS for planar features) actually lives in the R s2 package.

If that's correct, we shouldn't put effort in directly interfacing with S2, but rather with the C or C++ layer in R's s2. Which would then indeed require making this part of s2 more standalone as outlined in https://github.com/r-spatial/s2/issues/160 / mentioned by Brendan in the comment above.

paleolimbot commented 2 years ago

Thanks all for looping me in here! @jorisvandenbossche is right that anything resembling a GEOS-like interface lives in the R package. More specifically, I copied the BigQuery Geography API (which copies the PostGIS API more or less), because I could test it as I went.

I'm conflicted about a C API for S2 because it robs S2 of some of the things that make it so great (despite the fact that I opened that ticket for myself). That said, S2 C++ is complicated to use and an awesome library that nobody has access to in Python is not doing anybody any good. I have some ideas about how to organize a C API and will give it a try since it seems like there may be interest in helping if I get it started. The nice part about doing this within the S2 R package is that there's near 100% test coverage, although eventually a C API should live elsewhere.

Another approach I've been thinking about is one that operates on vectors of geometries (as ABI-stable C ArrowArrays) rather than individual geometries. Operating on chunks of geometries is nice because you don't have to sacrifice performance on points to support multipolygons (et al.)...you just send more features in a chunk and the implementation can loop without a bunch of tiny allocs or virtual method calls. This depends on the somewhat experimental stuff that Joris, I, and others have been working on with how geometry can be passed around as a struct ArrowArray.

Or a combination of both! I'll start my own C API ticket and see what happens.

I probably missed a few things from the thread...let me know if I didn't catch something important!

brendan-ward commented 2 years ago

one that operates on vectors of geometries (as ABI-stable C ArrowArrays)

This sounds very interesting! I've been wondering where some of the boundaries might be with those (mostly a separate topic), namely serialization to / from file storage vs on the compute side between a client library (e.g., something in Python or R) and the underlying geometry implementation (e.g., GEOS or S2).

Is the idea that the core object that is used in the client library becomes the ArrowArray, which is then passed to a layer in C/C++ that translates from those to GEOS or S2 objects? Is that limited to geometry construction / serialization from the underlying library (GEOS / S2), and otherwise the client holds on to wrapped pointers to GEOS or S2 objects?

you just send more features in a chunk and the implementation can loop without a bunch of tiny allocs or virtual method calls

I was thinking about this too in terms of trying to figure out which operations in a hypothetical C API for S2 would need to work on scalar geometries vs 1D arrays of geometries and have the underlying implementation do the looping rather than the caller (passing arrays seems very appealing). The part where I got hung up was that on the Python side (in pygeos/shapely), we're leveraging looping mechanics from numpy to give us the inner loop over multi-dimensional arrays and then operating on scalar geometries. I.e., it is the client that knows how to do the looping rather than the underlying implementation. Though our most common use cases so far are likely 1D arrays of geometries (though @benbovy 's comment re: n-dimensional arrays is well taken), and I might not be thinking about this correctly at all.

I have some ideas about how to organize a C API and will give it a try since it seems like there may be interest in helping if I get it started

Yes - interested! (may not have tons of capacity to throw at this, but interested in being involved). I was thinking of drafting a Google Doc or similar to outline some thoughts on the C API, but your thoughts on this are going to be a much better starting point. Is there someplace where we (community) can start collaborating around details about the API more specifically?

Is there interest in a meeting to discuss this topic, and shared needs in the S2 / ArrowArray space for both Python and R?

benbovy commented 2 years ago

From this discussion it seems that the best path forward would be to develop a common, GEOS-like C API for S2 that can be leveraged from different "client" ecosystems such as R or Python geospatial libraries. Although you guys certainly have much more experience than I have with GEOS, Arrow and/or the Numpy C API, I'd still be happy to help somehow if I'm able to do that, as I would really like to see more S2 capabilities accessible from within Python.

Some naive questions:

@paleolimbot how much effort do you think it would require to refactor bits from the S2 R package into a target-language agnostic C API? Do you think that short or mid-term we could already have something (even with a few features only) so that we can start building Python wrappers on it?

Perhaps a very naive question: would it be possible to separate the building of a GEOS-like C API from the building of the S2 library itself? @brendan-ward's comment re: ideally the C API would live inside the S2 project, although I don't know if there's interest in that from the S2 maintainers point of view and this would also probably further slower the iteration speed. Note: there are conda packages for the S2 library.

Could we easily reuse with Numpy arrays a C API that is working with C Arrow arrays? I.e., perform broadcasting, convert to Arrow array, call the C API and then convert back to Numpy array at little overhead cost? As a user, I indeed find the pygeos/shapely approach based on Numpy ufuncs quite appealing :), even though I acknowledge that it is probably overkill for 80% of use cases.

I'm conflicted about a C API for S2 because it robs S2 of some of the things that make it so great

I was also wondering about that, but I guess that S2-specific wrappers may live alongside (and more-or-less independently from) a GEOS-like "standardized" S2 C API, depending on what we want to achieve using Python, R,...?

brendan-ward commented 2 years ago

@benbovy (speaking for myself here) - I don't think we've mady any decisions about which direction to pursue or even recommend, we're still gathering info to weigh pros / cons. So please don't take the suggestions around a C API for S2 as trying to shut down your work on the pybind11 / xtensor-python approach!

It does seem that in elaborating how we'd want a C API to work would also inform how we'd want an API via pybind11/xtensor-python to work (though a good bit of that would be inside the C++ tier rather than API exposed to python).

Question: if for the short term we wanted to stick to the 1D approach to get this implemented most quickly, and manage all the looping ourselves as a simple loop over that array, would that sidestep the issues around how to handle object arrays in xtensor-python that you were bumping into? I.e., sidestep the ufunc related issues specifically. Or is the issue more at a fundamental level of trying to have an object array of S2 objects at any dimension? (sorry - not entirely following your box example linked above)

Yet another idea, no idea if it would work. I wonder if there is middle ground of using pybind11 but with a more C style interface rather than object-oriented interface (i.e., don't use pybind to expose C++ classes with methods).

So in pygeos our core geometry object is just a struct that holds a couple pointers and is defined as a type:

typedef struct {
  PyObject_HEAD void* ptr;
  void* ptr_prepared;
} GeometryObject;

(other mechanics around managing the python type wrapping around this omitted here)

Which are then always passed around as PyObject pointers for use in numpy arrays or scalars; numpy doesn't need to know that . The ptr just points to a GEOSGeometry, which (if I follow) is itself just a struct around a GEOS C++ geometry object.

So (if I follow correctly), it seems like it would be possible (in theory, not up on pybind11 yet) to do something similar: constructors return a pointer to a struct (let's call this PyS2Geometry for example) that is what is stored into numpy arrays and passed back to Python; this has has an underlying C++ class underneath (let's call this S2Geometry) that is not visible to pybind11 specifically, but is accessible in your C++ tier by following the pointer. Note that the S2Geometry could just be a shell that manages more specific types underneath or via inheritance.

Then I think you could expose functions like intersects(PyS2Geometry a, PyS2Geometry b) (scalar example) to Python, and in your C++ tier get back the underlying S2Geometry, and call methods or functions using those.

So in this purely hypothetical example, pybind11 is really just helping with the boilerplate around creating a Python C extension, but allowing you to more easily have a C++ implementation underneath rather than having to define a C API in the middle. No idea what that means for working with numpy at the C++ level though.

benbovy commented 2 years ago

No worries at all! I'm really open about which direction to pursue and I'm enjoying the discussion here! Sorry if my tone suggested otherwise.

The box example is also confusing, sorry. I was messing around with pybind11 (and dumped some messy code in a branch), trying to see if we can create vectorized functions with custom types, both exposed to Python. I defined a box class wrapping the S2 equivalent but it could have been any other custom type, not even related to S2.

Note that the S2Geometry could just be a shell that manages more specific types underneath or via inheritance.

That is what I had in mind too (instead of exposing more specific S2 derived types to Python).

pybind11 is really just helping with the boilerplate around creating a Python C extension, but allowing you to more easily have a C++ implementation underneath rather than having to define a C API in the middle.

Yes this is what I wanted to check. Taking your example, I was wondering if pybind11 could even help with the boilerplate around the PyS2Geometry struct and the vectorization, i.e.,

However, I couldn't get it to work with the box example above. It is likely that I'm thinking about it too naively and that I'm missing something important. For example, py::vectorize doesn't work with non-POD types. It is not clear either to me how we can separate Numpy stuff from Eigen stuff implemented in this PR and in this pybind11 fork.

brendan-ward commented 2 years ago

@benbovy I did some testing with pybind11 around returning numpy arrays of custom objects here.

Let me preface this by saying I am not at all up to speed on modern C++, esp. pointer semantics, still quite unfamiliar with pybind11, and basically don't know what. I'm doing here. I think this is working for a hypothetical example of wrapping a foreign object into a class that we then return as a pointer. I only tried the 1D array case and did not use xtensor-python.

The things that seemed to make this work were using PYBIND11_NUMPY_OBJECT_DTYPE from the pybind #1152 issue to register our custom type as an object dtype for numpy, and then returning a unique_ptr (should it be shared_ptr??) to our custom class as each element in the array (so that instances weren't copied when assigning into the array for return). Also had to register the custom class with pybind11, so that it knew how to __repr__ it, for example.

For the example I construct a custom Point object from x and y input arrays (looping ourselves), where a Point holds an instance of a foreign object, meant to simulate an S2 object, which allocates and deallocates memory (I haven't yet profiled for memory leaks); it seems like it is calling the destructors properly and should release it.

Point was just meant to stand in for a wrapper around S2 objects, not a specific concrete Point type. Not sure if this solves the specific issues you were bumping into above.

paleolimbot commented 2 years ago

Hi all...sorry for the radio silence, I wanted to put some fingers to keyboard before I said that something was going to be easy or hard!

This thread has helped me a lot, in particular helping clarify that S2 doesn't necessarily need a C API, it really just needs the simple features compatability layer to live outside the code that interfaces with R. The real crux of it is that there is no S2Geometry class like there is with GEOS, which is what I had to invent when I wrote the bindings the first time. Handling the C linking stuff (catching exceptions, defining opaque pointer types) is an orthogonal problem that sounds like it won't be necessary. Anything Python-related in compiled code is over my head, so correct me if I didn't get the gist of that right.

I've opened a PR to start the process of extracting out a standalone S2Geography class and functions (e.g., s2_num_points() that operate on them ( https://github.com/r-spatial/s2/pull/165 ). Feel free to comment, particularly on the files in src/s2-geography/, since that would theoretically be the folder that a Python package could copy and use. My C++ is not awesome and I would be interested in any feedback on the design! Perhaps a good place to start would be geography.hpp for those keen.

With respect to the possibility of an S2Geography that sits on top of an Arrow array, I wrote the class in such a way that this could be implemented (i.e., an S2Geography does not have to own its data, although all the current subclasses do). That is perhaps a future battle.

martinfleis commented 2 years ago

Just a reminder that if we want to apply for the SDG grant in this round, the Proposal Submission Deadline is March 4, 2022.

Cc @benbovy

benbovy commented 2 years ago

Hello everyone, getting back on this thread after a few busy weeks, sorry!

@brendan-ward, @paleolimbot, many thanks for sharing what you've done!

@brendan-ward - Your test was very helpful. I could also make work the 1D array case with xtensor-python, using std::unique_ptr<MyClass> instead of MyClass for the type of the array elements. I couldn't make a universal function, though (maybe because of this?) I still don't have a full understanding of what I'm doing TBH, but this is already promising.

@paleolimbot - I like your approach in geography.hpp. I wonder if we couldn't use a template class (i.e., S2Geography<S2Point>, S2Geography<S2Polyline>, S2Geography<S2Polygon>, etc.) instead of runtime polymorphism, but it's probably not a big deal.

@martinfleis - We're close to the dealine, indeed :/ I'll try to see if I can draft a proposal early next week, using as a reference the former proposals that you mentioned during the community meeting. I'll keep you updated! We can have a call to discuss the details if needed and if you're available next Monday or Tuesday.

martinfleis commented 2 years ago

@benbovy we can always postpone it to another round (June 3). You can also check our GSoC project idea on S2 and salvage some text from that (https://github.com/geopandas/geopandas/wiki/Google-Summer-of-Code-2022#s2---bringing-spherical-geometry-to-python).

I am available on Monday after 5pm UTC and Tuesday at almost any time feasible for Europe. But while I am happy to help organising, I am of no use technically in this.

brendan-ward commented 2 years ago

@benbovy I should be available after 16:00 UTC on Monday/Tuesday if it would be helpful for me to join, but definitely do not hold it up on my account. Happy to help review proposal / ideas.

paleolimbot commented 2 years ago

Ok, it's done! There's a long way to go, but at least the simple features <-> s2geometry mapping is now in its own repo: https://github.com/paleolimbot/s2geography . I'm not awesome at CMake but it seems to work locally and on GitHub actions!

jorisvandenbossche commented 2 years ago

That's really great @paleolimbot !

jorisvandenbossche commented 2 years ago

We had a chat with Dewey, Benoît and Brendan yesterday, and some (partial/unstructured) notes can be found at https://hackmd.io/FiNFZ6wkQCufZyyxBewFYA?both

benbovy commented 1 year ago

Hi all,

I’m happy to report some good progress on this front.

s2geometry

s2geography

s2shapely (Python bindings)

https://github.com/benbovy/s2shapely/

What do you think about the name “s2shapely”? I kind of like it, especially if we are able to replicate most of shapely’s API. Perhaps it would make sense to eventually move the repository into the shapely Github organization?

The pybind11 workarounds used here still need to prove their "robustness", but if it works well in the general case (I'm optimistic 🙂) then I don’t see any major blocker for the next steps. I’m going to work on implementing more things in s2shapely by copying shapely’s API.

() EDIT: not real* Numpy ufuncs but pybind11 vectorized functions. (**) EDIT2: unfortunately, pybind11::vectorize doesn't seem to support releasing the GIL? https://github.com/benbovy/s2shapely/issues/2

martinfleis commented 1 year ago

@benbovy this is great!

What do you think about the name “s2shapely”?

I understand the choice but can already imagine a situation when debugging some geopandas bug:

I would probably try to pick another name that does not cause a confusion. The similar issue may happen when you are talking about it. It will always require clarification that this is the one with s2, not a shapely you already know.

benbovy commented 1 year ago

@martinfleis I agree this may be confusing in such situation. On the other hand, if the goal is to closely replicate shapely's API using a different geometry backend, choosing a name that is too unrelated may be confusing too for newcomers.

To help clarifying between the two packages maybe we could provide different class names, e.g., S2Point vs. Point, S2Polygon vs. Polygon, etc (https://github.com/benbovy/s2shapely/issues/1)? The base classes already have different names: Geography vs. Geometry. Also, calling s2shapely API with shapely.Geometry object or vice-versa is not possible and clear error messages would already help avoiding too much confusion.

Some alternative names:

paleolimbot commented 1 year ago

I think s2shapely is probably accurate for what you're trying to do here and I doubt it will be all that confusing; however, you could also use the name 's2geography'.

jorisvandenbossche commented 1 year ago

@benbovy thanks for the overview of todo items and progress!

To help clarifying between the two packages maybe we could provide different class names,

I would personally lean to rather use a more distinct package name, but then the same class names to make the API as similar as possible (except things like base class geography vs geometry, etc).

I was thinking the same as what @paleolimbot already mentioned as well: we could just use "s2geography", to keep it close to the C++ library it is wrapping? And since it is python vs C++, there is no naming conflict. Or do you think it is confusing because it might not be clear which of the two you are talking about? (most users will just use the python bindings, and so won't run into that confusion though)

jorisvandenbossche commented 1 year ago

we could just use "s2geography", to keep it close to the C++ library it is wrapping? And since it is python vs C++, there is no naming conflict

Although there is of course some conflict. For a package manager like conda, we would then need to use python-s2geography as the install name (for import s2geography), while on pip it could be just s2geography .. Yeah, that's certainly a downside of this option.

paleolimbot commented 1 year ago

Also, 's2geography' for the C++ library is a name I pulled out of my bum one day when my daughter was napping and could totally be changed (although maybe it's already a conda package?)

benbovy commented 1 year ago

I like the name "s2geography" for the C++ library!

Some other options:

arcly curvely spherely (EDIT) roundly

Those names are short, clear, at the same time close and distinct enough from shapely. They are also available on pypi / conda.

benbovy commented 1 year ago

Those names are short, clear, at the same time close and distinct enough from shapely. They are also available on pypi / conda.

Maybe a bit hard to pronounce, though? Any other (curved-shape-word)ly idea?

brendan-ward commented 1 year ago

Another option: spherically (if we're trying to stick to a ly suffix)

jorisvandenbossche commented 1 year ago

From the -ly options, I think I might like spherely the most (I can see the tagline "like shapely, but on the sphere")

I would personally still consider s2geography as an option as well, though (also for the python import name)

martinfleis commented 1 year ago

I'm on the same side as @jorisvandenbossche.

paleolimbot commented 1 year ago

+0.001 for "s2geography" (just because it opens the Python package up to doing more than just providing a compatible shapely interface).

benbovy commented 1 year ago

I like spherely (also the tagline!). spherically is nice too, although put in a sentence without any highlight it might be slightly confusing.

I'd prefer spherely over s2geography for the Python library because from the user point of view it is closer to shapely (similar API) and also to avoid being it too tied to the C++ package which is more an implementation detail IMHO (possible that the Python library provides additional features or includes other dependencies in the future?). There's also the conda package issue mentioned by @jorisvandenbossche, although that's a minor concern.

benbovy commented 1 year ago

+0.001 for "s2geography" (just because it opens the Python package up to doing more than just providing a compatible shapely interface).

Yes that makes sense too! Although I think it is OK if "spherely" vs. "shapely" do not share the same exact API and features.

That is actually my concern with using the same name "s2geography" for both the C++ and Python libraries: it restricts more the Python package to just providing Python bindings to the C++ library.

paleolimbot commented 1 year ago

Fair enough! 'spherely' is nicer to say too 🙂