DOI-USGS / nhdplusTools

See official repository at: https://code.usgs.gov/water/nhdplusTools
https://doi-usgs.github.io/nhdplusTools/
Creative Commons Zero v1.0 Universal
85 stars 33 forks source link

Using `memoise` to enhance performance and reduce network traffic #366

Closed rburghol closed 7 months ago

rburghol commented 12 months ago

This example is based on some code that @dblodgett-usgs shared in an issue on cacheing NWIS queries. https://github.com/DOI-USGS/dataRetrieval/issues/681

Setup

Essentially, one selects a dat directory to store caches, and writes a wrapper around nhdplustools queries using the package memoise. I used a 1-year timeout since I figured that this data changes slowly, but the veracity of that assumption is less important than the technique:

dir <- "/media/model/usgs/cache"
db <- memoise::cache_filesystem(dir)
one_day <- 24*60^2
one_year <- 365 * one_day
memo_get_nhdplus <- memoise::memoise(nhdplusTools::get_nhdplus, ~memoise::timeout(one_year), cache = db)
memo_get_UT <- memoise::memoise(nhdplusTools::get_UT, ~memoise::timeout(one_year), cache = db)
memo_plot_nhdplus <- memoise::memoise(nhdplusTools::plot_nhdplus, ~memoise::timeout(one_year), cache = db)

Retrieving point and basin info

After initial setup, calling these functions more than once in a year will see it searching the cache before going out to get fresh data.

First time it takes a full second in my instance, since data is not stored

system.time(nhd <- memo_get_nhdplus(out_point)) Spherical geometry (s2) switched off Spherical geometry (s2) switched on user system elapsed 0.13 0.00 0.83

subsequent query is instantaneous

nhd <- memo_get_nhdplus(out_point)

- Retrieving basin info has a much larger time savings, like 7 seconds for a relatively large basin (~3,500 sqkm)

system.time(nhd <- memo_get_nhdplus(m_cat$basin)) Spherical geometry (s2) switched off although coordinates are longitude/latitude, st_intersects assumes that they are planar Spherical geometry (s2) switched on user system elapsed 3.83 0.14 6.59

again:

system.time(nhd <- memo_get_nhdplus(m_cat$basin)) user system elapsed 0 0 0

### Testing
- Points and basin data caches work well in my limited testing:
   - Data caches persist and are accessible afrestarting my windows machine.
   - Once got an error retrieving from cache after quitting rstudio, and 
- Plots work OK within session, but maybe fail after restarts?:
   - `memoised` plots successfully re-render within a single Rstudio session.
   - Substantial time savings are achieved with `memoised` plots (3-5 seconds in my test)
   - Plots DO NOT re-render after a restart, even though they are retrieved, and the data *seems* intact, but the images do not show up in the window.
      - looks to be a feature of the plot_nhdplus() way of plotting, which renders the plot, but does not return it as part of the function data list

m_cat <- memo_plot_nhdplus(list(nhd_out$comid))


![image](https://github.com/DOI-USGS/nhdplusTools/assets/4571170/3d24c163-a7da-44b2-8e03-9824dc74dea0)
dblodgett-usgs commented 11 months ago

+1 Thanks for the prompt @rburghol -- I need to think about how to best use this kind of thing in the package. nhdplusTools has undergone a lot of change recently and needs some further clean up.

dblodgett-usgs commented 7 months ago

I've introduced memoise as a dependency for something else and will start working it in over time. Sorry this has been on the back burner for a while.

dblodgett-usgs commented 7 months ago

I roughed in an implementation that I'm pretty happy with. No doubt there'll be issues, but it's a good start. See #364.

You can now set environment variables to control cache location (memory or disc) and duration. I've wrapped functions that make cacheable requests in memoise and use the pattern discussed above to control cache behavior. It defaults to a filesystem cache for one day.

dblodgett-usgs commented 7 months ago

A start is in now -- please test and open follow up issues if things are not right.