duckdb / duckdb-r

The duckdb R package
https://r.duckdb.org/
Other
120 stars 25 forks source link

Spatial Extension support #55

Closed eitsupi closed 6 months ago

eitsupi commented 9 months ago

I saw this post of ibis. https://ibis-project.org/posts/ibis-duckdb-geospatial/

It would be great if the R client also had an integration with the Spatial Extension. (For now, it seems to be difficult to handle because it is converted to the raw type of R)

Perhaps integrating with the geoarrow package would make sense? @paleolimbot Sorry for tagging you, but do you have any perspectives on such integrations?

paleolimbot commented 9 months ago

I would definitely be excited!

I think the general issue is that by default the data comes back in an internal BLOB format (you need to call st_asbinary() manually to get WKB that can be parsed by wk or sf).

I'm hoping to get the Arrow output directly as geoarrow extension arrays ( https://github.com/duckdb/duckdb_spatial/issues/153 ) so that the geoarrow package ( https://github.com/geoarrow/geoarrow-c/tree/main/r/geoarrow ) can handle the whole thing automagically!

cboettig commented 9 months ago

I find duckdb works really nicely from R on spatial data already. I have a small wrapper since the syntax is a bit verbose otherwise, that will read in from duckdb as an sf object. We can of course use all the spatial extension functions before reading into R, which is nice for datasets that are too big for RAM.

quick example with lazy read that avoids downloading the data, reads in a few different spatial vector formats and performs a spatial join:

remotes::install_github("cboettig/duckdbfs")
library(dplyr)

url <- "https://github.com/cboettig/duckdbfs/raw/main/inst/extdata/world.gpkg"

countries <- 
  paste0("/vsicurl/", url) |> 
  duckdbfs::open_dataset()

cities <-
  paste0("/vsicurl/https://github.com/cboettig/duckdbfs/raw/",
         "main/inst/extdata/metro.fgb") |>
  duckdbfs::open_dataset()

countries |>
   filter(continent == "Oceania") |>
   spatial_join(cities, by = "st_intersects", join="inner") |>
   to_sf()

One problem that I have had though is for some reason the spatial extension does not seem to be available for Windows users. (It appears that windows extensions have to be built separately for R to be compatible with the rtools chain, and cannot use the Windows extension that all other duckdb platforms use(?) Core extensions are now built for Windows, but as I understand it, the spatial one is not. @krlmlr any insights here? https://github.com/duckdb/duckdb_spatial/issues/158 It would be great for windows users to be able to use duckdb for large spatial operations too...)

eitsupi commented 9 months ago

R to be compatible with the rtools chain, and cannot use the Windows extension that all other duckdb platforms use(?)

That's correct since DuckDB for Windows other than R uses the MSVC ABI and only R uses the GNU ABI.

cboettig commented 9 months ago

Thanks @eitsupi , that matches my understanding. It's great that all of the "core" extensions are built separately with the GNU ABI for R on windows. It's really sad that windows users can't access the spatial extension though at this time. @krlmlr -- any idea if we might get binaries for windows R users for the spatial extension?

eitsupi commented 9 months ago

I'm not familiar with how the DuckDB extensions are distributed, but they appear to be entirely defined by GitHub Actions, so why not simply port the following workflow to https://github.com/duckdb/duckdb_spatial? https://github.com/duckdb/duckdb/blob/a55f89cd9e956b3e575532e058c230461799ac64/.github/workflows/R.yml#L29-L69

In any case, that's another issue.

cboettig commented 9 months ago

Thanks @eitsupi , that's great! it does look like that recipe could be adjusted to build the spatial extension. It's not clear to me what repository ought to be implementing it -- the workflow you link to appears to depend on a custom action defined in the main repository (https://github.com/duckdb/duckdb/blob/a55f89cd9e956b3e575532e058c230461799ac64/.github/actions/build_extensions/action.yml) which in turn depends on scripts specific to that repository -- I guess that would all need to be duplicated in the spatial extension repo?

I'm very sorry if this was the wrong thread to address this issue, though it is specific to R. In any event, other than windows support I don't see what more needs to be done to support the spatial extension for duckdb-r? (Though I was glad to see the Ibis support, it doesn't look like the approach there supports directly passing through geospatial functions the way we dbplyr does, and so they are implementing these bit by bit, while on the R side we seem to be able to use any function available in the geospatial extension immediately (e.g. like the new st_quadkey).

krlmlr commented 9 months ago

The extensions help page has:

Only core extensions are distributed for the following platforms: windows_amd64_rtools, ...

On the other hand, the overview page doesn't mention core extensions. According to the list of official extensions, "spatial" is an official extension.

@hannes @Mytherin: Who can help shed some light here?

eitsupi commented 9 months ago

I don't see what more needs to be done to support the spatial extension for duckdb-r?

I wanted to point out that DuckDB lacks the ability to convert spacial types to the appropriate R types. (In other words, the duckdb R package needs to map the DuckDB spacial types properly)

I don't think there is anything this repository can do about spacial extensions that work on R on Windows. Just configure CI to properly build and upload extensions where appropriate.

cboettig commented 9 months ago

Thanks @krlmlr and co! It would be wonderful if spatial could be added to the github action that builds the other official extensions for R. (I guess it's not obvious if building the ducdkb R extension for windows belongs in the repo that handles all the other duckdb extensions for windows, the repo that handles the duckdb for R, or the repo that handles the spatial extension :sweat_smile: )

@eitsupi thanks for your help. Re mapping to native R types, this is just a matter of using the correct read methods when using those types; e.g. if the desired R type is an sf object, we merely need to convert the geometry to WKB and then specify the geometry column correctly in st_read(), https://github.com/cboettig/duckdbfs/blob/main/R/to_sf.R#L46-L51

But for users of terra for whom the appropriate type would be a vect object, they would do something similar. However, as you know, the spatial functions in these packages for vector data are designed for in-memory objects, so if a user wants to compute something like st_intersect() on a very large vector dataset, I think they are much better off doing it in duckdb (e.g. as in the example above) rather than reading it into a native format. Of course it would be great if packages like sf or terra could handle this automagically with lazy eval, (kinda like the way dbplyr does), but in any event this all seems out of scope for duckdb-r, no?

hannes commented 8 months ago

We're going to pick this up and build all extensions for windows_amd64_rtools

krlmlr commented 6 months ago

Per https://github.com/duckdb/duckdb-r/issues/100#issuecomment-1980517821, this should work now? Can you confirm? Closing in favor of #100.

eitsupi commented 6 months ago

My original intent was that this issue is not about Windows but about the proper conversion between Geospatial and R types. Could you please reopen this?

krlmlr commented 6 months ago

Reopening, but the discussion got mixed up. I'd appreciate it if we could start a fresh discussion with the most important findings, up-to-date, summarized and linked here.

paleolimbot commented 6 months ago

Perhaps not a complete summary, but:

hannes commented 6 months ago

As far as I know spatial now builds for rtools

krlmlr commented 6 months ago

Yes, spatial is good now, confirmed by @carlopi. Opened a new issue to investigate the OP.

cboettig commented 5 months ago

Just wanted to confirm that spatial extension appears to be working nicely for Windows R users now too, at least as per our windows CI.