R-ArcGIS / arcgisgeocode

Utilize public or private ArcGIS Geocoder Services from R. Provides reverse geocoding, candidate search, single address, and batch geocoding.
http://r.esri.com/arcgisgeocode/
Apache License 2.0
40 stars 6 forks source link

Jobs in multiple batches aren't processed correctly #25

Closed aaronkrusniak closed 4 months ago

aaronkrusniak commented 4 months ago

Did some more stress testing today and came across this bug— it appears that jobs exceeding the max batch size for a geocoder will fail in one of two ways:

Here's an illustration for this one:

library(arcgis)
library(arcgisbinding)
library(arcgisgeocode)
library(tidyverse)

arc.check_portal()
set_arc_token(auth_binding())

# Some dummy data:
music_venues <- tribble(
  ~Name,              ~Address,
  "Aragon Ballroom",  "1106 W. Lawrence Ave.",
  "House of Blues",   "329 N. Dearborn St.",
  "Bottom Lounge",    "1375 W. Lake St.",
  "The Vic",          "3145 N. Sheffield Ave.",
  "Park West",        "322 W. Armitage Ave.",
  "Thalia Hall",      "1807 S. Allport St.",
  "Lincoln Hall",     "2424 N. Lincoln Ave.",
  "Schubas Tavern",   "3159 N. Southport Ave."
)

# Set up to use our geocoder:
service <- "https://maps.chicago.gov/arcgis/rest/services/Chicago_Addresses/GeocodeServer"
chicago_geocoder <- geocode_server(service)

# Check max batch size:
message(paste0("Max batch size is: ",
               chicago_geocoder[["locatorProperties"]][["MaxBatchSize"]]))

#> Max batch size is: 1000

# Create datasets of varying sizes...

# Just under max batch size (992 rows):
short <- music_venues %>% slice(rep(1:n(), each = 124))

# Exactly at max batch size (1000 rows):
exact <- music_venues %>% slice(rep(1:n(), each = 125))

# Exceeding max batch size (1008 rows):
long <- music_venues %>% slice(rep(1:n(), each = 126))

# Way over max batch size, but an even multiple of 1000 (8000 rows):
mega <- music_venues %>% slice(rep(1:n(), each = 1000))

# Try to geocode each size:
#####-----------------------------------------#####

results.short <- geocode_addresses(single_line = short$Address,
                                   geocoder = chicago_geocoder)
results.short
#> Simple feature collection with 992 features and 61 fields...

#####-----------------------------------------#####

results.exact <- geocode_addresses(single_line = exact$Address,
                                   geocoder = chicago_geocoder)
results.exact
#> Simple feature collection with 1000 features and 61 fields...

#####-----------------------------------------#####

results.long  <- geocode_addresses(single_line = long$Address,
                                   geocoder = chicago_geocoder)
#> Error in data.frame(..., check.names = FALSE) : 
#>   arguments imply differing number of rows: 1000, 1008

#####-----------------------------------------#####

results.mega  <- geocode_addresses(single_line = mega$Address,
                                   geocoder = chicago_geocoder)
#> Warning messages:                                          
#>   1: In data.frame(..., check.names = FALSE) :
#>   row names were found from a short variable and have been discarded
#> 2: In data.frame(..., check.names = FALSE) :
#>   row names were found from a short variable and have been discarded
#> 3: In data.frame(..., check.names = FALSE) :
#>   row names were found from a short variable and have been discarded
#> 4: In data.frame(..., check.names = FALSE) :
#>   row names were found from a short variable and have been discarded
#> 5: In data.frame(..., check.names = FALSE) :
#>   row names were found from a short variable and have been discarded
#> 6: In data.frame(..., check.names = FALSE) :
#>   row names were found from a short variable and have been discarded
#> 7: In data.frame(..., check.names = FALSE) :
#>   row names were found from a short variable and have been discarded
#> 8: In data.frame(..., check.names = FALSE) :
#>   row names were found from a short variable and have been discarded

results.mega
#> Simple feature collection with 64000 features and 61 fields...
# (note: all rows are empty in this result)

Additionally, sometimes I'm getting that Warning message: In data.frame(... check.names=FALSE): row names were found from a short variable and have been discarded on any job, even ones that are smaller than the max batch size. Whenever this happens, the whole data frame returns with empty results. I can't seem to create a reprex for that particular issue; I've noticed that resetting arc.check_portal() and set_arc_token(auth_binding()) seems to resolve it, so maybe my portal authorization is just timing out? But sometimes it seems like I can go 20 minutes without running into a problem, and other times it seems like I can't even go 2 minutes before it happens. Not sure what's going on, but happy to drop an issue in the arcgisbinding repo if it ends up being better suited there.

Thanks!

JosiahParry commented 4 months ago

This seems to be an issue with either your geocoder or custom json parsing. I cannot repro on the world geocoder. I'll keep looking!

No errors or warnings with the following:

library(arcgis)
library(arcgisgeocode)
library(dplyr)
library(tibble)

set_arc_token(auth_user())

# Some dummy data:
music_venues <- tribble(
  ~Name,              ~Address,
  "Aragon Ballroom",  "1106 W. Lawrence Ave.",
  "House of Blues",   "329 N. Dearborn St.",
  "Bottom Lounge",    "1375 W. Lake St.",
  "The Vic",          "3145 N. Sheffield Ave.",
  "Park West",        "322 W. Armitage Ave.",
  "Thalia Hall",      "1807 S. Allport St.",
  "Lincoln Hall",     "2424 N. Lincoln Ave.",
  "Schubas Tavern",   "3159 N. Southport Ave."
)

# Just under max batch size (992 rows):
short <- music_venues |> slice(rep(1:n(), each = 124))

long <- sample_n(music_venues, 1008, replace = TRUE)
res_long <- geocode_addresses(long$Address)

mega <- sample_n(music_venues, 4000, replace = TRUE)
res_mega <- geocode_addresses(mega$Address)
res_mega
aaronkrusniak commented 4 months ago

Darn— that does make sense though! Thanks for checking, I'll dig into our internal geocoder from my end and see if I find anything useful.

JosiahParry commented 4 months ago

Ah, I see where it's going wrong.

In parsing custom json we need to pre-allocate vectors. I am pre-allocating based on n (the total number of features) as opposed to the chunk size itself! Instead of passing n I need to be passing the size of the chunk

JosiahParry commented 4 months ago

Another issue I think that we're encountering is that there might actually be an error in the JSON but since w'er parsing any json that comes our way, we're not actually capturing the fact that an error is occurring

JosiahParry commented 4 months ago

@aaronkrusniak Do you have rust available on your machine? If so, could you test by installing this branch: https://github.com/R-ArcGIS/arcgisgeocode/tree/custom_loc_batches ?

aaronkrusniak commented 4 months ago

I do not, but this is a great excuse for me to try to get it; I've been meaning to start dipping a toe in rust. If I'm able to set it up on my organization pc I'll give it a shot and let you know!

aaronkrusniak commented 4 months ago

Alright, looks like I'll be able to set Rust up but it's going to require IT approval, which usually takes 2-7 days at my org. If there's a faster way you'd like me to try to get you feedback, let me know @JosiahParry!

JosiahParry commented 4 months ago

Sounds good! I can merge to main and revert it if we need to as well. Nbd

JosiahParry commented 4 months ago

I've merged the branch with main and bumped the version. There should be a new r universe build shortly !

aaronkrusniak commented 4 months ago

Just updated and it's working as expected now, thanks!

JosiahParry commented 4 months ago

@aaronkrusniak wooooot!!! Keep the feedback coming. This is very helpful