R-ArcGIS / arcpbf

Rust crate and R package for processing Esri Protocol Buffers
https://r.esri.com/arcpbf/
Apache License 2.0
8 stars 0 forks source link

arc_select fails on large dataset #2

Closed ryanzomorrodi closed 4 months ago

ryanzomorrodi commented 4 months ago

Describe the bug arc_select() crashes when downloading a large dataset. Not sure if this is truly connected to it being large or some other aspect of the data, but I can confirm that I am able to download other, smaller datasets.

To Reproduce Use arc_open to open a feature service and arc_select to download the feature service and return it as an sf.

library(magrittr)
library(arcgis)

options(RUST_BACKTRACE=1)

PRCP_pred <- "https://services.arcgis.com/GL0fWlNkwysZaKeV/arcgis/rest/services/TXLA_ZCTA_PRCPpred/FeatureServer/0" %>%
    arc_open() %>%
    arc_select()
#> thread '<unnamed>' panicked at arcpbf\src\parse.rs:11:22:
#> internal error: entered unreachable code
#> thread '<unnamed>' panicked at arcpbf\src\lib.rs:156:1:
#> explicit panic
#> Error in multi_resp_process_(resps) :
#>   User function panicked: multi_resp_process_

I made the feature service public so you should be able to try to download it for yourself

Expected behavior I expected the feature to be downloaded and stored as an sf. Annoyingly, the error only happens after it seems to have downloaded the entire feature.

Additional context R version 4.4.0 (2024-04-24 ucrt) Platform: x86_64-w64-mingw32/x64 Running under: Windows 10 x64

JosiahParry commented 4 months ago

Well here is the good news. I can reproduce the bug with only one feature! Bad news, there's a bug 😲. I'm sorry about that. I'll work on it.

JosiahParry commented 4 months ago

What is really interesting is that the protocol buffer is saying that this is a small integer but actually the field type is a date and it is not being processed as such.

I'm not sure if this is a bug in the feature service or the library to be honest!

Field type is: EsriFieldTypeOid
Field type is: EsriFieldTypeString
Field type is: EsriFieldTypeSmallInteger
Value { value_type: Some(StringValue("2017-08-01")) }
JosiahParry commented 4 months ago
``` r
library(arcgis)
#> Attaching core arcgis packages:
#> → arcgisutils v0.3.0
#> → arcgislayers v0.3.0
#> → arcgisgeocode v0.1.3
#> → arcgisplaces v0.1.0

PRCP_pred <- "https://services.arcgis.com/GL0fWlNkwysZaKeV/arcgis/rest/services/TXLA_ZCTA_PRCPpred/FeatureServer/0" |> 
    arc_open() |> 
    arc_select(n_max = 100)

PRCP_pred 
#> Simple feature collection with 100 features and 6 fields
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -10171550 ymin: 3365293 xmax: -9906416 ymax: 3576996
#> Projected CRS: WGS 84 / Pseudo-Mercator
#> First 10 features:
#>    fid GEOID                DATE PRCPpred Shape__Area Shape__Length
#> 1    1 70001 2017-07-31 20:00:00        0    20781247      27489.86
#> 2    2 70002 2017-07-31 20:00:00        0    11972609      14513.87
#> 3    3 70003 2017-07-31 20:00:00        0    24840919      31051.86
#> 4    4 70005 2017-07-31 20:00:00        0    15136408      20291.43
#> 5    5 70006 2017-07-31 20:00:00        0     9506517      13857.12
#> 6    6 70030 2017-07-31 20:00:00        0   126821239      72154.08
#> 7    7 70031 2017-07-31 20:00:00        0    12218343      16125.83
#> 8    8 70032 2017-07-31 20:00:00        0     6061464      13245.77
#> 9    9 70036 2017-07-31 20:00:00        0    17300199      30751.63
#> 10  10 70037 2017-07-31 20:00:00        0   233311656     137159.07
#>                          geometry
#> 1  POLYGON ((-10041429 3500168...
#> 2  POLYGON ((-10038862 3506457...
#> 3  POLYGON ((-10045203 3500427...
#> 4  POLYGON ((-10035132 3501549...
#> 5  POLYGON ((-10042042 3507365...
#> 6  POLYGON ((-10077192 3475608...
#> 7  POLYGON ((-10054588 3494937...
#> 8  POLYGON ((-10019921 3497074...
#> 9  POLYGON ((-10035782 3468335...
#> 10 POLYGON ((-10027180 3474258...
JosiahParry commented 4 months ago

I've gone ahead and pushed a change to the R package which should be available as a binary within the next hour or so.

https://r-arcgis.r-universe.dev/arcpbf

JosiahParry commented 4 months ago

@ryanzomorrodi new version hit cran this morning. Please let me know if this works for you!

ryanzomorrodi commented 4 months ago

It seems like there is a different error caused by x being null in post_process_single.

library(arcgislayers)

PRCP_pred <- "https://services.arcgis.com/GL0fWlNkwysZaKeV/arcgis/rest/services/TXLA_ZCTA_PRCPpred/FeatureServer/0" |> 
    arc_open() |> 
    arc_select(n_max = 1000)
#> Error in x[[1]]: subscript out of bounds

Created on 2024-07-10 with reprex v2.1.1

JosiahParry commented 4 months ago

hm. I'm not running into this issue.

image

What versions of arcgislayers and arcgisutils are you running?

packageVersion("arcgislayers")
packageVersion("arcgisutils")
JosiahParry commented 4 months ago

Perhaps you can run sessionInfo() as well

ryanzomorrodi commented 4 months ago

I have version 0.3.0 for both arcgislayers and arcgisutils. Also it may help to not provide a n_max and see if you can reproduce it that way. I tried to reproduce the error again today with n_max = 1000, and everything works. When pulling the entire layer, I eventually encounter the error. Also I should mention the httr2 progress bar has seemed to disappear.

library(arcgislayers)

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/Chicago
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] arcgislayers_0.3.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.36     fastmap_1.2.0     xfun_0.45         glue_1.7.0       
#>  [5] knitr_1.47        htmltools_0.5.8.1 rmarkdown_2.27    lifecycle_1.0.4  
#>  [9] cli_3.6.3         reprex_2.1.1      withr_3.0.0       compiler_4.4.1   
#> [13] tools_4.4.1       evaluate_0.24.0   yaml_2.3.8        arcgisutils_0.3.0
#> [17] rlang_1.1.4       fs_1.6.4
JosiahParry commented 4 months ago

Thanks @ryanzomorrodi for your help here! This helped me identify a regression in arc_select() this is being fixed (and a test added so it doesn't happen in the future).

When working with detailed and large geometries, it is suggested to drop the page_size argument down to something much smaller. When arc_select() is ran, it checks the property x[["maxRecordCount"]] to identify the maximum number of features that can be returned per request.

With detailed geometries AGOL/Enterprise can actually time out before it has prepared the geometries to be sent. To prevent this, we reduce the number of geometries sent per request. In this case, I ran the below which worked quite well and quite fast!

library(arcgislayers)

x <- "https://services.arcgis.com/GL0fWlNkwysZaKeV/arcgis/rest/services/TXLA_ZCTA_PRCPpred/FeatureServer/0" |> 
    arc_open() 

res <- x |> 
    arc_select(n_max = 25000, page_size = 10)

Though you will need to install the development version of {arcgislayers}.

Though looking at this service, there is a good chance that it might be far too big to hold entirely in memory and perhaps you might want to limit the scope of your queries using fields and where or filter_geom arguments.

ryanzomorrodi commented 4 months ago

Thanks @JosiahParry ! Excited to see where these arcgis packages go.