bluebonnet-data / bbd_r

bbd package for R
2 stars 0 forks source link

Geocoding #1

Open inh2102 opened 3 years ago

inh2102 commented 3 years ago

I've done a lot of geocoding so I'm happy to help spearhead developing that, preferably in R first and can be adapted for Python!

anyone else?

inh2102 commented 3 years ago

If we want to make it a nice wrapper around Nominatim one R package that is great is osmdata. Here's some sample code I used & shared with another fellow recently:

library(tidyverse) #install.packages("tidyverse")
library(osmdata) #install.packages("osmdata")
library(leaflet) #install.packages("leaflet")

df <- read_csv("address_file.csv") #row.names=F?

locations <- df$location # Vector of addresses
longlat <- lapply(locations, function(x) {
  Sys.sleep(1) # Required for this API, which bans IP addresses that submit >1 requests per second
  print(osmdata::getbb(x)[,1])
}) %>% do.call(rbind, .)
d <- data.frame(locations,longlat); d <- d[!is.na(d$x)&!is.na(d$y),] # d will be dataframe of address, longitude, and latitude.

# Mapping

leaflet(cbind(d$x,d$y)) %>%
  addTiles() %>%
  addCircleMarkers(
    col="darkred",fill=T,
    stroke = FALSE, fillOpacity = 0.2)

leaflet(cbind(d$x,d$y)) %>%
  addTiles() %>%
  addMarkers(clusterOptions = markerClusterOptions(),popup=d$locations)
nprezant commented 3 years ago

Nice! I think this would be super useful!

I'd be inclined to separate the geocoding from the mapping, and allow it to use cached results (so that the caller doesn't have to wait 1 second per address every time they want to generate this map, and so that we are courteous to the Nominatim service). What do you think of something like this?

geocode <- function(addresses, cache_file) {

    # If the caller passes in a path to a cached data file and that file already exists,
    # just return the data in that file instead of requesting it from the API again.
    if (!missing(cache_file) && file.exists(cache_file))  { 
        # Read in the cached data.

        # If the addresses in the cached file match the addresses requested, simply
        # return the dataframe.

        # Otherwise, the addresses do not match the ones requested, so we'll need to
        # re-geocode. Continue on.
    }

    # Do the geocoding (mapping address --> lat/long like you have above)
    ...

    # If the caller passed in a place to cache the file, then write it out to a csv
    if (!missing(cache_file)) {
        # df.write_csv() or something
    }

    # Return the dataframe of Address, Latitude, and Longitude
    return ...
}
inh2102 commented 3 years ago

@boom-roasted sounds perfect! Agreed on separating mapping + geocoding. I'll have some time to contribute on the actual code next week and if anyone else sees this and wants to help add a comment!