geoarrow / geoarrow-r

Extension types for geospatial data for use with 'Arrow'
http://geoarrow.org/geoarrow-r/
Apache License 2.0
155 stars 6 forks source link

Support RecordBatch with geoarrow #34

Closed JosiahParry closed 9 months ago

JosiahParry commented 9 months ago

Using some arrow-rs, geoarrow-rust, and extendr magic, I am able to return a RecordBatch with a geoarrow array in it to R as a nanoarrow_array_stream, however, using geoarrow-r I've not been able to get this as a geoarrow array. I can use as.data.frame() to get it into a data.frame but without any nice geometry column

library(httr2)
devtools::load_all()
#> ℹ Loading serdesri
furl <- "https://services.arcgis.com/P3ePLMYs2RVChkJx/ArcGIS/rest/services/USA_Counties_Generalized_Boundaries/FeatureServer/0"
url <- paste0(furl, "/query?where=1=1&outFields=*&f=json&resultRecordCount=100")
req <- httr2::request(url)
resp <- httr2::req_perform(req)
json <- httr2::resp_body_string(resp)

# parse body as RecordBatch
res <- parse_esri_json_raw_geoarrow(resp$body, 2)
res
#> <nanoarrow_array_stream struct<OBJECTID: int64, NAME: string, STATE_NAME: string, STATE_FIPS: string, FIPS: string, SQMI: double, POPULATION: int32, POP_SQMI: double, STATE_ABBR: string, COUNTY_FIPS: string, Shape__Area: double, Shape__Length: double, geometry: geoarrow.polygon{list<rings: list<vertices: fixed_size_list(2)<xy: double>>>}>>
#>  $ get_schema:function ()  
#>  $ get_next  :function (schema = x$get_schema(), validate = TRUE)  
#>  $ release   :function ()  

x <- as.data.frame(res)
#> Warning in warn_unregistered_extension_type(x): geometry: Converting unknown
#> extension geoarrow.polygon{list<rings: list<vertices: fixed_size_list(2)<xy:
#> double>>>} as storage type
#> Warning in warn_unregistered_extension_type(storage): geometry: Converting
#> unknown extension geoarrow.polygon{list<rings: list<vertices:
#> fixed_size_list(2)<xy: double>>>} as storage type
head(x$geometry)
#> <list_of<list_of<list_of<double>>>[6]>
#> [[1]]
#> <list_of<list_of<double>>[1]>
#> [[1]]
#> <list_of<double>[39]>
#> ... truncated it for everyone's sake
paleolimbot commented 9 months ago

Is there any chance that adding a requireNamespace("geoarrow") solves it? I'm wondering if the extension registration just didn't kick in.

JosiahParry commented 9 months ago

Will report back in the morning. I didn't try that!

JosiahParry commented 9 months ago

Loading geoarrow and then using as.data.frame() results in a session crash. I wish those were easier to debug!

image
paleolimbot commented 9 months ago

If you get serde_esri to a point where I can build it I'm happy to debug! Right now I get

clang -arch arm64 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/Library/Frameworks/R.framework/Resources/lib -L/opt/R/arm64/lib -o serdesri.so entrypoint.o -L./rust/target/release -lserdesri -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation
clang: error: no such file or directory: 'entrypoint.o'

...after a local checkout of arrow-extendr.

I could also share how I debug this kind of thing...I basically have the "CodeLLDB" extension in VSCode and use "Attach to process" from the command palette using Sys.getpid() from an R terminal (not RStudio!). I also have the following in my .Rprofile:

lldb <- function(pkg = ".") {
    url <- sprintf(
      "vscode://vadimcn.vscode-lldb/launch/config?{'request':'attach','pid':%d}",
      Sys.getpid()
    )
    system(sprintf("code --open-url %s", shQuote(url)))

    if (!is.null(pkg)) {
      devtools::load_all(pkg)
    }
  }

...which basically means that I can type lldb() in any R terminal (again, not RStudio!) and then paste a reprex that might crash. I haven't tested that on anything except MacOS or Windows but I'm pretty sure CodeLLDB works on Linux, too.

paleolimbot commented 9 months ago

Something else that may help is doing arrow::as_arrow_table(<nanoarrow_array_stream>)$ValidateFull(). That will tell you if the arrays that you are expecting nanoarrow/geoarrow to convert are valid. (I expect that they are, and that this is a bug with the C/C++ in geoarrow-r).

JosiahParry commented 9 months ago

@paleolimbot is this method only available in the development version of arrow? Running this on 14.0.2 results in

arrow::as_arrow_table(res)
#> Error in `arrow::as_arrow_table()`:
#> ! No method for `as_arrow_table()` for object of class nanoarrow_array_stream
JosiahParry commented 9 months ago

The package should be installable if cloned. https://github.com/JosiahParry/serde_esri/tree/main/r remotes::install_github() isn't working due to relative paths outside of the R package which I'll have to figure out at a later point.

Edit: should be installable via remotes::install_github("josiahparry/serde_esri", subdir = "r") now

JosiahParry commented 9 months ago

Here we go! Assuming I've done everything correctly, this is valid arrow!

library(httr2)
library(serdesri)
furl <- "https://services.arcgis.com/P3ePLMYs2RVChkJx/ArcGIS/rest/services/USA_Counties_Generalized_Boundaries/FeatureServer/0"
url <- paste0(furl, "/query?where=1=1&outFields=*&f=json&resultRecordCount=100")
req <- httr2::request(url)
resp <- httr2::req_perform(req)
json <- httr2::resp_body_string(resp)

res <- parse_esri_json_raw_geoarrow(resp$body, 2)
rdr <- arrow::as_record_batch_reader(res)
arrow::as_arrow_table(rdr)$Validate()
#> [1] TRUE

Created on 2024-02-04 with reprex v2.0.2

paleolimbot commented 9 months ago

Nice!

I can reproduce the crash, although I also sometimes get:

Error in geoarrow_schema_parse(schema) : 
  GeoArrowMetadataViewInit() failed: Expected valid GeoArrow JSON metadata but got '{"crs":null,"edges":null}'

Technically that is invalid metadata, although geoarrow-c should probably handle "crs": null by just pretending that it was omitted completely. I'm guessing geoarrow-rs is what gave this to you.

My guess is that there's something awry in nanoarrow's delegation of extension arrays to other packages (not trivial!) or geoarrow-r, and perhaps something about an error occurring during that process is causing the crash. I'll do some more debugging to see if I can get to the bottom of it!

paleolimbot commented 9 months ago

@paleolimbot is this method only available in the development version of arrow?

It was added int he brand-new version of nanoarrow along with a number of other array/array_stream converter generics! (install.packages("nanoarrow")).

JosiahParry commented 9 months ago

Much appreciated! I'll take a look and see if I can set the CRS at minimum. I'm unsure how I'd be able to guess if the edges are spherical or not without processing the spatial reference and making that determination that way 🤔

JosiahParry commented 9 months ago

Technically that is invalid metadata, although geoarrow-c should probably handle "crs": null by just pretending that it was omitted completely. I'm guessing geoarrow-rs is what gave this to you.

FWIW, I think this can be resolved in the geoarrow-rs crate by adjusting the serialization method for ArrayMetadata struct.

kylebarron commented 9 months ago

yeah it's invalid and I just haven't gotten around to fixing it

JosiahParry commented 9 months ago

Wowza!!! Looks great!!!