R-ArcGIS / arcpbf

Rust crate and R package for processing Esri Protocol Buffers
https://r.esri.com/arcpbf/
Apache License 2.0
7 stars 0 forks source link

bug: failure to parse `esriFieldTypeBlob` (arcpbf) #6

Open JWilliamsonArch opened 1 month ago

JWilliamsonArch commented 1 month ago

Hi!

I have noticed that there is an {arcgislayers} bug in arc_select() which prevents it from loading and opening valid ArcGIS Rest Services.

Description The package, {arcgislayers}, cannot load ArcGIS Rest services with the command arc_select() . The rest service is valid and opens with other software.

Reproducible Example

library(arcgis)
library(tidyverse)
library(sf)

furl <- "https://geonb.snb.ca/arcgis/rest/services/GeoNB_SNB_Municipal_Information/MapServer/1"

Municipal_Roads <- arc_open(furl) %>% 
  arc_select(.)

Version: R version 2024.04.2+764 "Chocolate Cosmos" Running under: Windows 11 Pro

Expected behavior

The features should load into R as sf linestring features.

The features fail to load, and the following error message is printed: "Error in multi_respprocess(resps) :
User function panicked: multi_respprocess"

JosiahParry commented 1 month ago

From what i can tell, there are 10614 features in this dataset. I cannot be sure how detailed they are either.

library(arcgis)

flayer <- arc_open("https://geonb.snb.ca/arcgis/rest/services/GeoNB_SNB_Municipal_Information/MapServer/1")

arc_select(flayer, returnCountOnly = "true") |> sum()

Lets say you have 1000 vertices on average for each of these vertices. Each is an double value (64 bit float). Then that comes out to 1.27GB of memory for the geometry alone fs::fs_bytes(64 * 2 * 1000 * 10614). That doesn't take into account the json that is used to transfer the data or the object IDs etc.

SO I think what is happening is that you are exhausting your memory and also possibly being rate limited by the feature service.

Do you need the entire feature service in memory? Or can you filter it down and limit the fields that you need?

JWilliamsonArch commented 1 week ago

Thanks for your response, and I'm sorry about my late reply. I used QGIS to download and export this dataset instead.

Discussions about the memory required to handle this dataset seem premature, as vectors like this road network are common in many professional GIS situations. This particular vector dataset is not extraordinarily large, and often, it is necessary to view the whole table. My bug report was specific, but the problem is general.

Are there any settings in my R environment that I could change that might allow this package to handle this dataset type? I will also look for other R solutions to this problem.

elipousson commented 1 week ago

It looks like there is a specific issue with the "SE_ANNO_CA" column:

library(arcgis)
#> Attaching core arcgis packages:
#> → arcgisutils v0.3.0
#> → arcgislayers v0.3.0.9000
#> → arcgisgeocode v0.2.1
#> → arcgisplaces v0.1.0
library(tidyverse)
library(sf)
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE

furl <- "https://geonb.snb.ca/arcgis/rest/services/GeoNB_SNB_Municipal_Information/MapServer/1"

Municipal_Roads <- arc_open(furl) %>% 
  arc_select(
    fields =  "SE_ANNO_CA"
  )
#> Iterating ■■■■■■                            17% | ETA:  6s
#> Iterating ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  100% | ETA:  0s
#> Error in multi_resp_process_(resps): User function panicked: multi_resp_process_

Created on 2024-09-12 with reprex v2.1.1

JosiahParry commented 1 week ago

Oh interesting! This is a blob field! This should actually be a pretty easy fix. I genuinely didn't expect to ever run into blob data. We can see below that the blob type isn't handled. This just needs to be captured as a raw vector.

https://github.com/R-ArcGIS/arcpbf/blob/a9231e25c0c8ace8886f4d58b7339a278b28b3d8/src/rust/arcpbf/src/parse.rs#L147

JosiahParry commented 1 week ago

This is partially addressed in this branch: https://github.com/R-ArcGIS/arcpbf/tree/blob.

I am not able to find any public feature services with non-null blob field types so I am unsure how to process it. At present this will detect any non-null blob entries and provide a warning message if they are encountered.

library(arcgislayers)

furl <- "https://geonb.snb.ca/arcgis/rest/services/GeoNB_SNB_Municipal_Information/MapServer/1"

arc_open(furl) |> 
  arc_select(
    fields =  "SE_ANNO_CA"
  )
#> Iterating ■■■■■■                            17% | ETA:  6s
#> Iterating ■■■■■■■■■■■                       33% | ETA:  3s
#> Iterating ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  100% | ETA:  0s
#> Simple feature collection with 10614 features and 1 field
#> Geometry type: MULTILINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: 2311937 ymin: 7288878 xmax: 2710576 ymax: 7674195
#> Projected CRS: NAD83(CSRS) / New Brunswick Stereographic
#> First 10 features:
#>    SE_ANNO_CA                       geometry
#> 1             MULTILINESTRING ((2537588 7...
#> 2             MULTILINESTRING ((2375380 7...
#> 3             MULTILINESTRING ((2589807 7...
#> 4             MULTILINESTRING ((2493805 7...
#> 5             MULTILINESTRING ((2557756 7...
#> 6             MULTILINESTRING ((2477038 7...
#> 7             MULTILINESTRING ((2627394 7...
#> 8             MULTILINESTRING ((2475265 7...
#> 9             MULTILINESTRING ((2524728 7...
#> 10            MULTILINESTRING ((2425289 7...
JosiahParry commented 1 week ago

Note that {arcpbf} is in its 8th day pending CRAN manual checks. This will only be addressed following a decision from CRAN. Moving to the {arcpbf} repo

JWilliamsonArch commented 5 days ago

I found that this package could pull down other similar-sized datasets, so the problem does seem to be with the esri blobs as a data type.

I also found that the arcpullr package* can download this file, even with the Esri blob in place, which is worth mentioning for other people running into this issue.

JosiahParry commented 5 days ago

It may also be worth noting that arcpullr is orders of magnitude slower than arcgislayers. It works because in this feature layer the blobs are entirely null—no actual binary data. arcpullr uses json whereas arcpbf uses protocol buffers which are ~1/10th of the memory foot print. They are also processed in Rust.