Exposing various Geography subclasses?

benbovy / spherely

Manipulation and analysis of geometric objects on the sphere.

https://spherely.readthedocs.io/

BSD 3-Clause "New" or "Revised" License

119 stars 8 forks source link

Exposing various Geography subclasses? #27

Closed jorisvandenbossche closed 1 month ago

jorisvandenbossche commented 1 year ago

Currently we have a Geography base class and the Point, Linestring and Polygon subclasses. s2geography itself has in addition also a GeographyCollection.

https://github.com/benbovy/spherely/pull/26 is adding more subclasses, and I questioned the need for that: do we want / need all those subclasses? (or at least when adding them, we should be able to answer for ourselves why we do it / what value it brings)

jorisvandenbossche commented 1 year ago

Copying @benbovy comments from https://github.com/benbovy/spherely/pull/26#issuecomment-1470316475:

Maybe something to also discuss more generally: do we want / need all those subclasses?

Do you mean that alternatively we could expose only one Geography class (with a geometry type property) along with top-level creation functions for each kind of geometry?

I see two advantages of having subclasses:

it is convenient for quick creation of a few simple features (perhaps more readable too)
it is consistent with Shapely (is there any plan to eventually drop those subclasses in Shapely?)

One argument against subclasses is that it is a bit more code to maintain (maybe some duplicate logic here and there), but that's not much in my opinion.

Another alternative would be to expose a unique class for Point vs. MultiPoint, LineString vs. MultiLineString, etc. like in s2geography.

this also departs from Shapely's API
I'm wondering what is the actual cost in performance of reusing the same class for single features (i.e., the unique S2 object put into a 1-element std::vector), especially for trivial functions applied to large arrays of single features (a rather common use case?). I have no idea, though, maybe splitting the classes in s2geography would be premature optimization.

jorisvandenbossche commented 1 year ago

Do you mean that alternatively we could expose only one Geography class (with a geometry type property) along with top-level creation functions for each kind of geometry?

Potentially that, or stick with the current situation of having just having Point/Line/Polygon subclasses.

I see two advantages of having subclasses:

it is convenient for quick creation of a few simple features (perhaps more readable too)

Yes, but we could also have easy constructor functions as an alternative

it is consistent with Shapely (is there any plan to eventually drop those subclasses in Shapely?)

There is no plan to drop the subclasses in Shapely. And I am certainly convinced we should try to provide an API that is familiar for shapely users, but that doesn't necessarily mean it has to be exactly the same. Not all things from Shapely/GEOS will make sense anyway, and at the same time there will also be additional concepts from S2 we will want to expose here that doesn't have an equivalent in Shapely/S2.

Another alternative would be to expose a unique class for Point vs. MultiPoint, LineString vs. MultiLineString, etc. like in s2geography.

That's basically the current situation, right?

My understanding is that the s2geography PointGeography, PolylineGeography and PolygonGeography already can hold "zero or more" points/lines/polygons. The distinction between single and multi object is very much something from the simple features standard that isn't always that useful.

There is also a cost when we have to wrap an array of s2 geographies into python objects that we have to infer which python subclass to use for every object. For example for an PolygonGeography / S2Polygon, we would have to go over the loops (rings) to infer if this represents logically a single Polygon or a MultiPolygon (i.e. what wkt-writer.cc does right now in s2geography for printing)

martinfleis commented 1 year ago

As long as we can do a roundtrip from shapely to S2, I don't think we need to replicate shapely's classes 1:1. From what @jorisvandenbossche describes, it seems that we should be able to do the roundtrip without a need for Multi* classes, in most cases.

I am just not sure what happens for single geometry Multi* geometry. In shapely, that is perfectly valid (e.g. 'MULTIPOINT (0 0)') and common object. If we don't have MultiPoint on the spherely side, I guess that it would become a Point when converting from shapely and back. That may be a reason for having all those classes 1:1.

benbovy commented 1 year ago

do we want / need all those subclasses? or at least when adding them, we should be able to answer for ourselves why we do it / what value it brings

I agree, although reciprocally we might also ask ourselves why we shouldn’t do it?

I’m not sure about the current state of the simple features standard, but I see in many places the “multi” geometry entities explicitly included in the data model. Why not having these as subclasses in Spherely too? Especially that the “multi” subclasses added in #26 represent only a few dozen of binding C++ code lines at most, they don’t add much maintenance burden.

I guess one important question is also how often are used "multi" geometries vs. "single" geometries in real-world scenarios?

Not all things from Shapely/GEOS will make sense anyway, and at the same time there will also be additional concepts from S2 we will want to expose here that doesn't have an equivalent in Shapely/S2.

Agreed as well, as long as this would keep reasonably smooth the integration of Spherely as an alternative geometry engine to Shapely.

My understanding is that the s2geography PointGeography, PolylineGeography and PolygonGeography already can hold "zero or more" points/lines/polygons. The distinction between single and multi object is very much something from the simple features standard that isn't always that useful.

Yes that’s right, but to me it looks like an implementation detail rather than an end-user API design choice (assuming that the initial motivation of refactoring R’s s2 code into s2geography wasn’t to provide 1st-class C++ API).

I don’t have enough experience to have a proper opinion on whether or not this is useful to have both single and multi objects, but from my naive point a view it may avoids some confusion (one-to-one correspondance / exact match between the class name and the WKT representation) while it doesn’t add much complexity to the whole data model.

If we make the “multi” subclasses inherit from their single subclass counterpart, perhaps this would be more consistent with the simple features data model and this might simplify the implementation of some functionality?

There is also a cost when we have to wrap an array of s2 geographies into python objects that we have to infer which python subclass to use for every object. For example for an PolygonGeography / S2Polygon, we would have to go over the loops (rings) to infer if this represents logically a single Polygon or a MultiPolygon (i.e. what wkt-writer.cc does right now in s2geography for printing)

Hmm wouldn’t it be a good reason to have distinct classes in s2geography and/or a virtual function in the s2geography::Geography base class that returns the kind of geometry (incl. single vs. multi)?

jorisvandenbossche commented 1 year ago

My understanding is that the s2geography PointGeography, PolylineGeography and PolygonGeography already can hold "zero or more" points/lines/polygons.

Yes that’s right, but to me it looks like an implementation detail rather than an end-user API design choice (assuming that the initial motivation of refactoring R’s s2 code into s2geography wasn’t to provide 1st-class C++ API).

I am not sure about that (but pinging @paleolimbot since this starts to get into s2geography design questions as well). The PolygonGeography in s2geography wraps a single S2Polygon, but AFAIU this S2Polygon is inherently both a Polygon or MultiPolygon, depending on the number of loops. Quoting a comment from geography.h: Note that a single S2Polygon (from the S2 perspective) can represent zero or more polygons (from the simple features perspective).

(the R s2 package also doesn't expose any subclasses to the R user, so from their perspective, I think having those subclasses in s2geogrpahy is less important)

There is also a cost when we have to wrap an array of s2 geographies into python objects that we have to infer which python subclass to use for every object. ...

Hmm wouldn’t it be a good reason to have distinct classes in s2geography and/or a virtual function in the s2geography::Geography base class that returns the kind of geometry (incl. single vs. multi)?

Maybe, but the inference still has to happen somewhere (since S2 itself doesn't make this distinction). So if not inferring in spherely, then it's still s2geography that has to do that in such a geog_type property (btw, I think such a property would be useful to inspect as a user, but I think whether we use it to determine subclasses is a separate question). For example, if you do an intersection overlay operation on PolygonGeography / S2Polygon objects, you are sure to get back a new PolygonGeography / S2Polygon (in contrast to GEOS where you can get mixed dimension results, S2 guarantees preservation of dimension AFAIU). But this resulting S2Polygon can still logically represent both a Polygon or MultiPolygon, even if the inputs where logical Polygons.

benbovy commented 1 year ago

the R s2 package also doesn't expose any subclasses to the R user, so from their perspective, I think having those subclasses in s2geogrpahy is less important

This is also my understanding even though I'm not familiar with R idioms. By contrast, having a detailed hierarchy of classes exposed in Python feels quite idiomatic to me (as well as in C++).

Note that a single S2Polygon (from the S2 perspective) can represent zero or more polygons (from the simple features perspective).

This raises an important question: (if meant to be used directly in C++ applications) should s2geography's API be closer to the s2geometry perspective or the simple features perspective? I'd rather expect the latter given that s2geography is described as a simple features compatibility layer built on top of s2geometry, but again this is my naive view as non-experienced end-user and there are likely other considerations into play.

The same question may be asked for Spherely (if I remember well this was also discussed when picking a name for the Python package).

If we use in Spherely a unique class for representing both single and multiple geometries of a given type, one concern I have is which name to choose for the class? Taking points as an example, I feel stuck between:

the name PointGeography (like in s2geography), which sounds good as this can basically mean "some geographic feature made of one or more points", but it departs more from shapely
the name Point (like in shapely), which sounds a bit weird for multiple points whereas I'd expect that it would just mean "it's a point" (actually, s2geometry's S2Polygon for multiple polygons sounds weird to me too)

but the inference still has to happen somewhere

AFAIU the inference cannot be fully avoided regardless of whether we use distinct classes for single vs. multi features (at least for polygons), e.g., the WKT writer example. The geog_type property might help optimizing this inference when it is possible. Besides the WKT writer example, are there other cases (how many) when there is no other way around than doing full inference from the underlying S2 object(s)?

paleolimbot commented 1 year ago

My initial reaction is that the geographies themselves don't have many methods outside of the core interface, so exposing them is probably not all that useful. The C++ hierarchy is designed to support a flexible data storage...there are other ways to represent points, lines, and polygons than std::vector<std::unique_pointer<S2Polyline>> and friends (e.g., maybe a single S2Polyline if it becomes clear that's more efficient in some cases)?

You could, of course, emulate a Point, Linestring, Polygon or whatever class hierarchy you want and implement methods using the s2_XXX() functions (but tying it to s2geography is, in my opinion, a bad idea, because you won't be able to support other ways to represent those geometries). I do have every intention of adding those other ways to represent geometries (but it may be a while until I get to it). In short, the C++ subclasses are about storage (and should stay that way).

if you do an intersection overlay operation on PolygonGeography / S2Polygon objects, you are sure to get back a new PolygonGeography / S2Polygon

I am pretty sure you can get a point or line back here (although there is an option that I put in somewhere that lets you choose the dimensionality of the result)

Besides the WKT writer example, are there other cases (how many) when there is no other way around than doing full inference from the underlying S2 object(s)?

Search for dynamic_cast...there are more than I would like. Basically anywhere that uses a method of an S2Polyline or S2Polygon (as opposed to using a MutableS2ShapeIndex or iterating over the shapes manually).

benbovy commented 1 year ago

Thanks for your input @paleolimbot. Yes it makes sense that s2geography provides subclasses that are designed around flexible storage. I think that emulating Point, LineString, etc. is important for an end-user API like in Spherely, where the design should be focused more on the user experience than on the storage of the underlying S2 objects.

I understand that the Geography sub-types do not provide much more functionality beyond the core interface, but in my opinion still providing them as classes in Spherely (so that we have matching SF data model vs. Python class hierarchy) would make it super easy for users to get started, assuming that many of them are already familiar with Shapely or other tools using the SF model. Or alternatively we could only expose one Geography class for all sub-types and identify them using a property and/or their WKT representation (like in R’s sf?). Anything in-between would make things confusing IMHO.

With more subclasses there would be indeed a bit more work to do when casting an output object down to its right type before returning the result back to Python, but I don’t think it is a big deal. A few helpers with no big impact on performance may address this.

No problem for not having all the final (emulated) Geography subclasses tied to s2geography, as long as it is possible to extend these in 3rd party libraries like Spherely. There are a few dirty workarounds used in #26 that I think could be solved by adding a few more virtual methods to s2geography::Geography for, e.g., cloning objects, WKT formatting, etc. (it could also replace at least some of the dynamic_cast). I’ll open a new issue in the s2geography repository with more details about this.

paleolimbot commented 1 year ago

I can't speak to what Python users do or don't want - I think if they expect those subclasses to exist they should - I was mostly making the point that those classes are doing something (exposing functionality unique to a particular geometry type) different than s2geography's C++ classes (which expose the same functionality but allow for flexible storage options).

I see a lot of C++ in #26 and I wonder if you actually want is a pure Python hierarchy (backed by an array of Geography that have possibly all been validated to be of the type expected by the class)? (Far be it from me to make Python design choices though 🙂 )

benbovy commented 1 year ago

I see a lot of C++ in https://github.com/benbovy/spherely/pull/26 and I wonder if you actually want is a pure Python hierarchy (backed by an array of Geography that have possibly all been validated to be of the type expected by the class)?

Hmm I haven't really thought about encapsulating an array of Geography into a custom class exposed in Python... There could be some advantages of doing it (uniform and/or invariant properties), but in general using raw Numpy arrays is easier to integrate with the rest of the Python ecosystem.

A good part of the additions in #26 relate to the creation, validation or copy of single objects, which we might want to drop at some point and instead reuse the functions in s2geography (this will be easier without having wrappers around s2geography classes).