JosiahParry / sdf

A back-end agnostic spatial data frame inspired by rust trait implementations
Other
27 stars 2 forks source link

Philosophy #1

Open JosiahParry opened 1 year ago

JosiahParry commented 1 year ago

A spatial data frame (sdf) that is agnostic to the geometry type / geometry "back-end." This is inspired by the sf package, Rust traits, and vctrs.

The goal isn't to make a single standard for geometry representation but rather a single data frame front end for these geometry libraries. Much like how dplyr acts as a front end for sql databases, data.table, spark, etc. an sdf is should behave the same regardless of what type of geometry type you have.

There are common functions that are commonly implemented in many different geometry libraries. By adding a method for a specific geometry backend one can inherit the functionality of any sdf.

An sdf is a tibble with a single geometry column. This geometry column must be a vector with a c() method implemented such that c(geo, geo) returns a geo.

JosiahParry commented 1 year ago
mdsumner commented 1 year ago

oh goodness yes, I just did this for wkb/wkt - nice one!

this looks very promising indeed, just fyi my key motivations are to get this as a tbl_lazy so we can pull OGR or GEOS pointers directly via GDAL and only materialize what's required from a given pipe-flow, I've been messing around with these in various ways but struggled with the more hard core aspects of the "handlers" framework ;)

JosiahParry commented 1 year ago

@mdsumner I think I understand conceptually what the handlers are but have no idea what is required from a low-level to implement one.

I think a generic spatial dataframe (sdf) would benefit from a default implementation that uses {wk} to convert to a "standard" geometry type (presumably sf) and uses that method and, using the handler, converts it back to the original geometry type.

I have an existing issue on {wk} looking for more help https://github.com/paleolimbot/wk/issues/172

mdsumner commented 1 year ago

the spatial dataframe is the singular best part of sf and the saddest of all for not being generally available, it was one of the hardest and earliest things (that I observed being crucibled ...) and I literally don't understand why there's no boundary drawn around it

so import wk here? was one of my questions, there's obviously some secret magic with or in rsgeo ? I'll figure it out but definitely keen to chat about serious steps forward and getting both of us deeper into handles 😃

paleolimbot commented 1 year ago

I tend to use "is a geometry column" == "implements the wk_handle() S3 generic", implements wk_crs(), and vctrs::vec_is() == TRUE.

I get that wk_handle() is poorly documented...people actually using it is sufficient motivation to document it properly 🙂 ...you two are probably the first to care!

Also note that you can implement wk_handle() without doing anything in C/C++. It's slightly inefficient, but you can do something like:

wk_handle.my_custom_class <- function(x, handler, ...) wk_handle(sf::st_as_sfc(x, handler, ...))`

...and that should work.

JosiahParry commented 1 year ago

I like the wk_handle() and vctrs::vec_is() check. But not so sure about the wk_crs() which is strictly becuase the geo-rust geometries do not track nor consider CRS in the geometries (perhaps a self-ish pov and a wk method for rust geometries could just return NA.

I think realistically, the best solution in a semi-near future would be to use geoarrow. A geometry column would have to be able to represent itself as geoarrow? This would permit easy / standard conversion. From a rust perspective this means 0 copy transfer to geopolars (if using this + polars), 0 copy to geos, geo-rust, gdal, wkt, csv etc via geozero.

paleolimbot commented 1 year ago

becuase the geo-rust geometries do not track nor consider CRS in the geometries

I think that makes sense for rust to not track the CRS (GEOS does not either, nor does wk at the C level); however, at the "r vector" level, the ability to propagate a "crs" attribute is rather helpful. As you noted, you can always have your method return NULL to not participate.

I think realistically, the best solution in a semi-near future would be to use geoarrow

I agree...although from an immediate point of view, geoarrow doesn't quite exist yet. Perhaps a concrete way of expressing that would be that it must implement nanoarrow::as_nanoarrow_array_stream() where the data type is a geoarrow extension type? When I talk about the wk handler system getting superceeded by something, that something may well be "iterate over chunks of features" (which that method will let you do).

mdsumner commented 1 year ago

I'm reading ruminations in geodesy, and the explanation of the context sounds a lot like handlers ... do you think that's a reasonable comparison @paleolimbot ?

https://github.com/busstoptaktik/geodesy/blob/main/ruminations/000-rumination.md#a-deep-dive

I hadn't realized that deeper level about the context in GEOS/PROJ either, specifically this part in the geodesy doc:

So forget about discussions on whether transformation definitions should be read from a local SQLite file database, a conection to an external database, or from a local text file: These unfruitful discussions can be laid to rest simply by providing a Context accessing resources in whichever form is most convenient for the case at hand.

That's sounding like what I wanted when I started silicate, which btw will be trivial in a spatial dataframe context, simply keep the vertices somewhere and index then from the geometry vector. A vctrs geom could maintain a global vertex pool as a property (perhaps in file or database or arrow). I'm excited that we might get broader discussion of the value of these various ways of storing spatial data, and intermediate forms between the common formats.

JosiahParry commented 1 year ago

I need to read those ruminations....fwiw the rust geoarrow WIP made it onto GitHub yesterday with compatibility with geodesy https://github.com/geoarrow/

paleolimbot commented 1 year ago

(I think https://github.com/kylebarron/geoarrow-rs ...I saw a quick demo of the WASM bit but have yet to give it a try!)

mdsumner commented 1 year ago

what I really want from a handler (and apologies, I think I've been down this road but I do want @JosiahParry to also see this) is

xy <- wk::wk_coords(x)
<do somethng_to_xy[_possiblyincluding_part/level/nesting/format-record-details]>
wk::coords(x) <- xy

And, do that for any handled format :)

In various ways I've been pursuing that since pre-2016 and I've had some good glimpses - but the formats have actually changed more often than the tools we have for dealing with them. I was only thinking about materialized forms back then too, and now I think this still has huge relevance, it's about when the work gets done, so the scheduler can look back through the code (should I upfront reproject all the coords or do it per-feature ...).

I appreciate this is probably not going to cut through but it's so clear to me rn with what I'm trying to do, but getting the capability as a general thing is a total distraction from what I need to do and I only hope I can get back to putting some effort into that again.

Nowosad commented 1 year ago

Hi @JosiahParry I read the readme with great interest. I assume you will mention this during the event in Munster?

The only question I have atm is about CRSs. Why they are not mentioned in the object header? Are they kept somewhere or Spatial Data Frame does not store/process them?

JosiahParry commented 1 year ago

@Nowosad, great question! It's not mentioned because as i made this prototype I was (and still am) working primarily with geo rust geometries which do not have a CRS associated with them. @paleolimbot did point out that this is just an attribute that can add to vector containing the geometries themselves and carried along for the ride.

In my view (I'm keen to hear more of them), the CRS would be an attribute that is associated with the vector of geometries and would not be required. Imagine if, for some weird reason, we were able to work with geometries from a game engine like bevy's line "gizmo". A CRS wouldn't make sense in that context but they are still geometries though!

I think this could be improved by adding a CRS field to the print method which would report based on a generic crs(). The CRS would not be a hard requirement to be a geometry though.

Nowosad commented 1 year ago

@JosiahParry all of that makes sense.

Another question I have is the behavior of the _geometry() functions and others (sidenote: naming can be improved, I've seen #2): sf works differently if the data has geographic or projected CRS (s2 vs geos, in short). How would that work here?

JosiahParry commented 1 year ago

The general idea I have is that we always defer to the implementing library for geometric functions. I think it is both convenient and somewhat odd that sf will change the geometry library being used based on the CRS.

If the geometry column is a vector of {geos} geometries the *_geometry() functions will use the provided .geos_geometry method. If we have a geos geometry vector I would assume it would use functions from those lirbaries only.

Consider another example where we might have a generic for calculating distance something like sdf_distance() we could check the CRS to see if it is spherical or planar and choose an appropriate distance method in the implementing library. The geo-rust libraries could be a good example here because it provides a standard euclidean distance measure and 3 geodesic ones (vicenty, haversine, and geodesic). The dispatched method could make a choice based on the CRS attribute for example.

To me, the {sdf} would be a standard set of behaviors for a spatial data frame. Then geometry libraries like geos, s2, rsgeo, or whatever else may come (maybe geomesa for example), would write methods to adhere to the standard. They can choose to calculate euclidean or geodesic distance of handle based on CRS. The overall goal is to take implementation details out of the data frame library and into the hands of the geometry library.