geoarrow / geoarrow-c

Experimental C and C++ implementation of the GeoArrow specification
http://geoarrow.org/geoarrow-c/
Apache License 2.0
23 stars 3 forks source link

[R]: geoarrow_vctr class #70

Open anthonynorth opened 1 year ago

anthonynorth commented 1 year ago

Add minimal geoarrow_vctr class for geoarrow vectors. Enables integration with other packages wk, sf, geos etc.

See https://github.com/geoarrow/geoarrow-c/issues/69#issuecomment-1778320000

anthonynorth commented 1 year ago

I can make a start on this piece if you're taking PRs.

Are we implementing a wk_vctr or are we taking a dependency on {vctrs} and implementing a vctrs_vctr?

Should sf::st_as_sf.geoarrow_vctr etc be provided by this package, similar to what was done in {wk}?

paleolimbot commented 1 year ago

I would love a PR! The basic minimum is something that can be put into a data.frame() or tibble() and implements wk_handle() + st_as_sfc() + as_nanoarrow_array_stream() (i.e., don't worry about subsetting if that causes grief).

I would probably use wk_vctr for now unless you run into problems. There are some low-level spatial developers that are in the 'no dependencies ever' camp and it doesn't take much overhead to implement the basics.

I think sf::st_as_sf.geoarrow_vctr() should live in this package for now.

paleolimbot commented 1 year ago

Are you still interested in having a stab at this? I have some time to take a look at this this week and am happy to give it a go (but also happy to defer to any work you've done already if you're interested 🙂 )

anthonynorth commented 1 year ago

I had intended to action this (almost) 2 weeks ago.

I am still interested in having a stab at this, but I've been focused on other projects.

I haven't implemented anything yet.

paleolimbot commented 1 year ago

No worries! I haven't gotten to it yet either. Just post a note here if/when you do (I'll do the same).

anthonynorth commented 1 year ago

Should geoarrow() replicate the behaviour in paleolimbot/geoarrow, where geoarrow::geoarrow() can accept any handleable? This makes geoarrow() pretty much an alias of as_geoarrow().

Or, should geoarrow() instead require a geoarrow array? This would be similar behaviour to the vectors in {wk}.

# untested example for wk-like
# need a suitable default for x
geoarrow <- function(x = as_geoarrow_array(wk::wkb()), crs = wk::wk_crs_auto(), geodesic = FALSE) {
  validate_geoarrow_array(x) # doesn't exist
  new_geoarrow(x)
}

new_geoarrow <- function(x = as_geoarrow_array(wk::wkb()), crs = wk::wk_crs_auto(), geodesic = FALSE) {
  structure(x, class = c("geoarrow_vctr", "wk_vctr"), crs = crs, geodesic = geodesic)
}

as_geoarrow <- function(x) {
  new_geoarrow(as_geoarrow_array(x), wk::wk_crs(x), wk::wk_is_geodesic(x))
}
paleolimbot commented 1 year ago

Or, should geoarrow() instead require a geoarrow array?

I think the stricter option is better, here, or you could omit the function since I don't know exactly how it would get used. It might be idiomatic to do something like vec_cast(something, geoarrow()), but even then it's not quite enough because a cast would require a stronger type?

anthonynorth commented 1 year ago

I think the stricter option is better

I agree.

A couple of usecases for a geoarrow() constructor:

A little off topic: a minimal implementation should be printable. I think we need to subset the geoarrow_vctr to achieve this without wasting a lot of memory?

paleolimbot commented 1 year ago

Those uses, do make sense...maybe it should be restricted to creating just a zero-size version for the "make me a ptype" use case?

I think we need to subset the geoarrow_vctr to achieve this without wasting a lot of memory?

It would require a slice, anyway. You can use nanoarrow_array_modify() to make a shallow copy that reduces the length of whatever the last chunk is. Feel free to punt that to another PR...I have a few functions I started on while on a plane a few weeks ago that may help. I'll try to put them up as a draft PR shortly.

paleolimbot commented 1 year ago

Ok, https://github.com/geoarrow/geoarrow-c/pull/75 . It's almost all about how to resolve a chunk using a binary search and various ways to access that. Maybe it would help if I polished up that and removed the vctr bits?

anthonynorth commented 1 year ago

Those uses, do make sense...maybe it should be restricted to creating just a zero-size version for the "make me a ptype" use case?

Maybe. I don't have a strong opinion on it. This is the same pattern used in s2 and geos.

Ok, https://github.com/geoarrow/geoarrow-c/pull/75 . It's almost all about how to resolve a chunk using a binary search and various ways to access that. Maybe it would help if I polished up that and removed the vctr bits?

I skimmed and saw only 1 trivial bug that can't be reached with the public api currently. I don't know that we need to remove the vctr bits. We can build from there

anthonynorth commented 12 months ago

How should we deal with non-contiguous slices? We'll need this for data.frame manipulation.

Should we shallow copy, allowing indices to be any integer vector within range? I think this could potentially create many nanoarrow::nanoarrow_array_modify() chunks.

Or should we clone the slice?

vec <- geoarrow::as_geoarrow_vctr(wk::xy(1:5, 1:5))
vec[c(1, 3)]
#> Error in `[.geoarrow_vctr`(vec, c(1, 3)): Can't subset geoarrow_vctr with non-slice (e.g., only i:j indexing is supported)

Created on 2023-12-05 with reprex v2.0.2

paleolimbot commented 12 months ago

How should we deal with non-contiguous slices?

The easiest way would be to convert to an arrow::ChunkedArray, subset that, and then recreate the vctr. I think that's what the previous version did. Someday I might get to implementing a filter or take specific to geoarrow-encoded arrays to remove the optional arrow dependency (or maybe we can use something wrapping arrow-rs to do the subset).

We'll need this for data.frame manipulation.

Maybe...it could also just be that it's a placeholder for now (although maybe the error message should reflect that: "must cast geoarrow_vctr column to native vector type before subsetting" or something). Notably, it should "just work" when somebody calls st_as_sf() on a data.frame containing one of those columns (but this has to exist on CRAN before I can PR that into sf).

paleolimbot commented 12 months ago

Or, as you noted, which would totally work, you could walk the indices searching for the minimum number of slices that can represent the new indices, which is potentially a very large number of slices. That will probably be slow and I would maybe rather error until it can be fast!

anthonynorth commented 12 months ago

I agree, walking the indices and making slices is very complicated and potentially more costly than a copy.

An optional dependency on arrow is probably safe. You could create a geoarrow_vctr without having arrow installed, but you'd have to start from something in R first to do that, I think? Seems unlikely to me.

Somewhat related and a bit of a left-field idea, is there value in a wk::wk_handle_indices() which could support this usecase in geoarrow, plus other formats?

wk_handle_indices(handleable, geoarrow_writer(), indices) |>
  as_geoarrow_vctr()
paleolimbot commented 12 months ago

Is there value in a wk::wk_handle_indices() which could support this usecase in geoarrow, plus other formats?

Possibly! The previous incarnation of geoarrow did something like that: https://github.com/paleolimbot/geoarrow/blob/master/src/geoarrow-handle-wk.cpp#L290-L315 . I am not sure in retrospect that it was a good idea for the specific use case of the vctr...it did allow for randomly sorting a vctr, but it also kept possibly large quantities of data from being freed until the "take" operation was resolved somehow. I think R users are more likely to be able to reason about an eager "take".

anthonynorth commented 12 months ago

I had imagined wk_handle_indices() would be performing a copy. The difficulty here, I think, is that wk readers are reading in order and we don't want that here. Re-writing readers to allow for arbitrary indices is a big change, I think?

My initial thinking was to implement this via a buffered filter, which holds a copy of each geometry in indices until we've read enough to pass onto the next handler. We only need to buffer while there's geometries we haven't yet seen which must be passed before what we have read (e.g. c(5, 1)). Seems to be a similar idea to this, just with wk flavour.

Worst case (reverse sort) this is 2 copies of the vector, so quite inefficient.