Enable Caching of data - Githubissues

DOI-USGS / dataRetrieval

This R package is designed to obtain USGS or EPA water quality sample data, streamflow data, and metadata directly from web services.

https://doi-usgs.github.io/dataRetrieval/

Other

256 stars 85 forks source link

Enable Caching of data #681

Closed rburghol closed 7 months ago

rburghol commented 8 months ago

I would propose that a simple local cache facility be enabled for functions like readNWISdv() and readNWISsite(), etc.

I know that my own workflow would be enhanced by caching historical queries, especially during development, when we might use DataRetrieval to hit NWIS many many times in a single day, for the same station. Obviously (or maybe not so much), we don't try to hit things multiple times, but when running code via Rscript or with data retrieval in a loop of sorts, it can happen that the same data gets pulled several times in a day.

I am guessing that I am not alone in this, but wanted to ask. If I put together some code for a feature to allow caching of data retrieval requests on the requestors local drive, would that be something that would be desirable for others? My thinking is that if this occurrence is widespread, server traffic might see a substantial savings with caching added.

library(dataRetrieval)
gageid <- '02024752'
historic <- dataRetrieval::readNWISdv(gageid,'00060', use_cache=TRUE, cache_file=paste0(gageid,'.csv')
gage_info <- dataRetrieval::readNWISsite(gageid, use_cache=TRUE)

# where:
# use_cache default = FALSE # retains existing behavior
# cache_file default = function + "_" + gageid + csv, i.e. 'NWISsite_02024752.csv' in the current working directory

So, if one did not explicitly set use_cache = TRUE all would proceed as normal

ldecicco-USGS commented 8 months ago

My gut reaction for that functionality is to recommend using the targets package: https://books.ropensci.org/targets/ targets is powerful and flexible with different ways to cache "targets" and I hate to re-invent the wheel.

Here's an example of how I'd use targets.

Let's say this is our basic R script:

setup <- list(site_ids = c("05114000", "04024430"),
              pcodes = c("00060", "00010"),
              startDate = "2023-10-01",
              endDate = "")

df_raw <- readNWISuv(siteNumbers = setup$site_ids,
           parameterCd = setup$pcodes,
           startDate = setup$startDate,
           endDate = setup$endDate)

df_long <- df_raw |> 
  renameNWISColumns() |> 
  select(site_no, Wtemp_Inst, Flow_Inst, dateTime) |> 
  pivot_longer(cols = c(Wtemp_Inst, Flow_Inst)) |> 
  filter(!is.na(value))

ggplot(data = df_long) +
  geom_line(aes(x = dateTime, y = value)) +
  facet_grid(name ~ site_no,
             scales = "free") +
  theme_bw()

If I wanted to instead use a targets workflow, I would first save those last commands as functions. Let's say I save this in a file called "functions.R":

longify <- function(x){
  x |> 
    renameNWISColumns() |> 
    select(site_no, Wtemp_Inst, Flow_Inst, dateTime) |> 
    pivot_longer(cols = c(Wtemp_Inst, Flow_Inst)) |> 
    filter(!is.na(value))
}

plot_it <- function(x){
  ggplot(data = x) +
    geom_line(aes(x = dateTime, y = value)) +
    facet_grid(name ~ site_no,
               scales = "free") +
    theme_bw()
}

Then create a file "_targets.R" with the following:

library(targets)
tar_option_set(packages = c("dataRetrieval", "tidyverse"))
source("functions.R")

list(tar_target(setup, list(site_ids = c("05114000", "04024430"),
                            pcodes = c("00060", "00010"),
                            startDate = "2023-10-01",
                            endDate = "")),
     tar_target(rawData, 
                command = readNWISuv(siteNumbers = setup$site_ids,
                                     parameterCd = setup$pcodes,
                                     startDate = setup$startDate,
                                     endDate = setup$endDate)),
     tar_target(df, longify(rawData)),
     tar_target(plot_out, plot_it(df)))

Putting it together I could run this:

tar_make()

▶ start target setup
● built target setup [1.36 seconds]
▶ start target rawData
Initiating curl with CURL_SSL_BACKEND: openssl
● built target rawData [1.34 seconds]
▶ start target df
● built target df [0.04 seconds]
▶ start target plot_out
● built target plot_out [0.02 seconds]
▶ end pipeline [2.99 seconds]
tar_load(plot_out)
plot_out

Let's say I want to fiddle with the plot style. If I edit the function.R file, so I change plot_it to:

plot_it <- function(x){
  ggplot(data = x) +
    geom_line(aes(x = dateTime, y = value)) +
    facet_grid(name ~ site_no,
               scales = "free") +
    theme_minimal() +
    theme(axis.title = element_blank())
}

If I run:

tar_visnetwork()

I see:

Which shows which parts of the workflow need updating. Running:

tar_make()
✔ skip target setup
✔ skip target rawData
✔ skip target df
▶ start target plot_out
● built target plot_out [0.01 seconds]
▶ end pipeline [1.66 seconds]
tar_load(plot_out)
plot_out

I get my updated plot (without having re-run the dataRetrieval code):

What if I do want to re-run the dataRetrieval code? Use tar_invalidate on the dataRetrieval target (so the same as clearing the cache):

tar_invalidate(rawData)
tar_make()
✔ skip target setup
Initiating curl with CURL_SSL_BACKEND: openssl
▶ start target rawData
● built target rawData [1.33 seconds]
▶ start target df
● built target df [0.03 seconds]
▶ start target plot_out
● built target plot_out [0.01 seconds]
▶ end pipeline [2.49 seconds]

If you didn't to jump ALL IN on targets, you could pull the df and plot_out part out of the _targets.R file, and start your script with:

tar_make()
tar_load(rawData)

df_long <- rawData |> 
  renameNWISColumns() |> 
  select(site_no, Wtemp_Inst, Flow_Inst, dateTime) |> 
  pivot_longer(cols = c(Wtemp_Inst, Flow_Inst)) |> 
  filter(!is.na(value))

ggplot(data = df_long) +
  geom_line(aes(x = dateTime, y = value)) +
  facet_grid(name ~ site_no,
             scales = "free") +
  theme_bw()

Another package I played around with a few years ago was the memoise package: https://cran.r-project.org/web/packages/memoise/index.html It's another option for caching results.

For the record....I've had it in my head to make something like this into one of the "Articles" on the dataRetrieval doc pages: https://doi-usgs.github.io/dataRetrieval/ This issue might inspire me to finally get to that. We talk a little already about targets in the large data pull article, but it can be useful in most any workflows.

rburghol commented 8 months ago

Thanks for the thoughts and code. I hear you on the virtue of not altering existing, working data streams though adding a new argument and optional behavior should make that not a problem -- assuming the code was perfect from day 1 :). Also, I was not aware of targets and it seems useful for R pipelines, and while I only have a surface knowledge of targets, that seems like a workable, and usable approach for some use cases inside of R. We do most of our pipelining via bash, so my comments may reflect that bias or just my lack of understanding of persistence in targets.

With that said, here are my thoughts on why that may not be an optimal approach:

This requires all routines that use NWIS to employ a new bit of non-trivial overhead in order to save network traffic.
On the contrary, enabling in function could propagate immediately if it included support for an Renviron var NWIS_use_cache_default = TRUE -- so users could automatically make their whole suite of workflows use caching with one setting.

Here also are a couple of questions (my assumptions), which may be simply my ignorance of targets:

Would targets be broken for the basic use case dataRetrieval::readNWISdv(gageid,'00060', use_case=TRUE)? In the use case given, I am speculating that targets would evaluate it as already having been satisfied since no dates were passed in. (which with very little effort would do a date check on the file in question, and if it were of the same date as current, return the cached version, whereas)
Would targets caches be persistent across multiple scripts/functions/R instances on a single server?

Now, I don't know what USGS experiences in terms of redundant queries as a percentage of daily traffic from users overall, so my experience may be an edge case, though I would be willing to guess that there may be a few % in a day. I frequently get timeouts when doing NWIS queries (like maybe 1 out of 2 requests time out at certain times of the day).

ldecicco-USGS commented 8 months ago

targets will work the same way as described above with your example dataRetrieval::readNWISdv(gageid,'00060'). The target workflow will not know that this particular function is out of date each day. It only knows that it ran it once and cached it. I'm not sure that's what you mean by being "broken". Personally that's how I would assume a cache would work as well...that the cache would always be stale if the dates were not included. There are some fancy ways to setup a target to be smarter about when to run or not. See tarchetypes : https://docs.ropensci.org/tarchetypes/index.html concepts.

With the example I used above, if you change the start date or added an end date, it would know that the preceding target "setup" changed, so that would trigger the readNWISuv function to be re-run.

You could set up a job that uses a variable (NWIS_use_cache_default = TRUE for example) to force the job to either re-run or use the cache (I'd again refer to tarchetypes ). But yes, it would require users to learn targets. So far, I've only really recommended it to groups that are doing large pulls (and thus NEED a pipeline). But, the targets community has grown so large in the past couple of years I've recently been thinking it's time to start evangelizing targets in these small to mid sized projects as well. We've studied the server logs in the past and haven't noticed any appreciable repeat queries that affect the performance UNLESS the code is being run in parallel and the server gets hit by 100s-1000s of jobs simultaneously. That can very quickly bog down the server and cause everyone problems.

You could setup a target workflow that runs some dataRetrieval code, and separate workflows could draw on those caches: https://books.ropensci.org/targets/projects.html#interdependent-projects The targets community is great, and always growing. I'd check out both the local and cloud sections for other ways to share caches: https://books.ropensci.org/targets/data.html https://books.ropensci.org/targets/cloud-storage.html

As you can probably tell, I'm very hesitant to add any caching to dataRetrieval. I know a lot of groups have put a lot of work into it to making really fantastic pipeline tools, and it's a big lift to do it well. I happen to prefer targets, but I don't knock other solutions. We made a design decision long ago to keep the project creep to a minimum, and I know caching would be more than just saving to a csv if an argument is set to TRUE (plus saving to a csv would lose all the attributes, which severely limit the usefulness of the data).

There are many reasons for dataRetrieval timeouts, diving into all would take a book. It can depend on what service you are hitting. Some services do really poorly with a lot of query parameters. Often if the result is just really big and take too long to transfer (several sites with a full period of record using the "uv" service will probably cause a timeout). If you are seeing timeouts 50% of the time, it might be worth creating a new issue with some example queries and we can think about ways to improve those results. For instance, I'd recommend the "readWQPsummary" and "whatNWISdata" functions to trim down the sites/parameters before getting all the data.

Our articles on "Large Data Pulls" talk about a few ways of dealing with timeouts (most of which I mentioned here already). They both use WQP, but the same ideas would work with NWIS data: https://doi-usgs.github.io/dataRetrieval/articles/wqp_large_pull_targets.html https://doi-usgs.github.io/dataRetrieval/articles/wqp_large_pull_script.html

You also might want to check out which also touches on some of these issues: https://rconnect.usgs.gov/NMC_dataRetrieval_2

dblodgett-usgs commented 8 months ago

Perhaps a lighter weight but still non-invasive approach could be using memoise?

e.g.

gageid <- '02024752'

dir <- tools::R_user_dir("dataRetrieval", "cache")

unlink(dir, recursive = TRUE, force = TRUE)

db <- memoise::cache_filesystem(tools::R_user_dir("dataRetrieval", "cache"))

my_readNWISdv <- memoise::memoise(dataRetrieval::readNWISdv, cache = db)
my_readNWISsite <- memoise::memoise(dataRetrieval::readNWISsite, cache = db)

system.time(my_readNWISdv(gageid, '00060'))
#>    user  system elapsed 
#>    0.14    0.19    0.73

system.time(my_readNWISdv(gageid, '00060'))
#>    user  system elapsed 
#>    0.01    0.00    0.01

system.time(my_readNWISsite(gageid))
#>    user  system elapsed 
#>    0.15    0.06    0.24

system.time(my_readNWISsite(gageid))
#>    user  system elapsed 
#>       0       0       0

^{Created on 2023-11-10 with reprex v2.0.2}

rburghol commented 7 months ago

Thanks for the dialog @ldecicco-USGS @dblodgett-usgs -- a quick follow up on my statement about "broken", and a few thoughts about caching: This:

The target workflow will not know that this particular function is out of date each day. is the answer to this: I'm not sure that's what you mean by being "broken".

Now, I don't wish to waste anyone's time, if my own intermittent timeouts (I want to be clear that 50% timeouts only occur episodically at what I suspect are maybe peak traffic times in a day) are both unique to me and/or a result of some network latency issue that my org suffers from, that is not your issue, and thanks for the thoughts on how I might cache. Thought to be sure, the timing of the memoise examples that @dblodgett-usgs provided certainly demonstrate potential performance gains, regardless of whether timeouts are an issue at all. And even if only 1% of queries are potentially cacheable, I bet that would save a ton of CPU/network resources with what I assume is the popularity of this service.

But, for the sake of discussion about cache function, I will share a few thoughts on data stream and caches:

Though the solutions proposed don't work in the most simplistic case (dataRetrieval::readNWISsite(gageid)), they CAN work, but IMO they are solutions for absence of cache, rather than a replacement for a proper data stream cache.
- A caching approach literally is not a functioning cache if there is no way to determine staleness -- it's more of a stache than a cache -- the proposed methods ask "do I think this stream is stale?", where a cache asks "hey data stream, are you stale?"
What does a proper native stream cache look like:
- A cache can be explicit, ie, ask the server if the cached data is stale: "yo, what's your latest daily date?"
- Or Implicit "i know that nwis dv will have daily means thru yesterday so i don't even gotta ask"
- A cache functions outside of the life of a single executable or session (so it needs to store data, hence my example of CSV)

OK, so why not just roll my own cache, or build a structure with another data stream tool?

I believe that native data stream cache, ie, from the data source itself (or package in this case) is always preferable, as it eliminates guessing about behavioral changes and state changes on the back end, and when ready for prime time can even be enabled by default.
IMO it is very easy for a cache work flow mod on a data stream service like this to be done non-invasively: if (use_cache ++ TRUE) { #then trigger cache and return} else { # proceed with legacy code }
Even just adding a new function wrapper would be fine (much like you both created), as it could be something that can be easily swapped out over a multitude of scripts, though not nearly as powerful as a method that could be turned on globally.
I generally am averse to "load a new package" since it is just another package to maintain.

And of course, the purpose of this issue is to see if there are folks who might want to collaborate on the cache idea, regardless of where it would sit. All the code examples that @ldecicco-USGS gave and @dblodgett-usgs gave are compelling and I am gonna keep this issue open for a few days whilst I spin up a couple of code samples for the approach that I favor. Thanks again!.

dblodgett-usgs commented 7 months ago

Perhaps something more like this?

gageid <- '02024752'

dir <- tools::R_user_dir("dataRetrieval", "cache")

unlink(dir, recursive = TRUE, force = TRUE)

db <- memoise::cache_filesystem(tools::R_user_dir("dataRetrieval", "cache"))

five_seconds <- 5
fifteen_minutes <- 60*15
one_day <- 24*60^2

# we could cache certain calls for shorter or longer depending on the type of request.

my_readNWISdv <- memoise::memoise(dataRetrieval::readNWISdv, ~memoise::timeout(fifteen_minutes), cache = db)
my_readNWISsite <- memoise::memoise(dataRetrieval::readNWISsite, ~memoise::timeout(one_day), cache = db)

system.time(my_readNWISdv(gageid, '00060'))
#>    user  system elapsed 
#>    0.23    0.20    0.72

system.time(my_readNWISdv(gageid, '00060'))
#>    user  system elapsed 
#>    0.02    0.00    0.02

system.time(my_readNWISsite(gageid))
#>    user  system elapsed 
#>    0.16    0.13    0.28

system.time(my_readNWISsite(gageid))
#>    user  system elapsed 
#>       0       0       0

# Just to show that this does work, let's put in a short timeout.

my_readNWISdv <- memoise::memoise(dataRetrieval::readNWISdv, ~memoise::timeout(five_seconds), cache = db)
my_readNWISsite <- memoise::memoise(dataRetrieval::readNWISsite, ~memoise::timeout(five_seconds), cache = db)

system.time(my_readNWISdv(gageid, '00060'))
#>    user  system elapsed 
#>    0.18    0.01    0.39

Sys.sleep(five_seconds + 1)

system.time(my_readNWISdv(gageid, '00060'))
#>    user  system elapsed 
#>    0.14    0.03    0.46

system.time(my_readNWISsite(gageid))
#>    user  system elapsed 
#>    0.06    0.06    0.12

Sys.sleep(five_seconds + 1)

system.time(my_readNWISsite(gageid))
#>    user  system elapsed 
#>    0.06    0.08    0.22

^{Created on 2023-11-13 with reprex v2.0.2}

rburghol commented 7 months ago

Yes, this is super cool, and also very very concise code-wise @dblodgett-usgs -- the use of the memoise/cachem::cache_filesystem plus a timeout seems to work with the simplest (and I assume therefore the most common) use case. A 36x improvement in speed is no joke!

dblodgett-usgs commented 7 months ago

Good deal. Yeah, I've always wanted an excuse to play with memoise -- I should start using it in nhdplusTools now that I've seen how it works. Very simple implementation of caching.

ldecicco-USGS commented 7 months ago

Yeah, I toyed with memoise a few years ago and even considered putting it in dataRetrieval, but ultimately decided that the workflow Dave showed above was much better than embedding it in the code itself. Thanks for adding those examples - I had meant to circle back to memoise on my 2nd comment, but I got caught up answering the target specific questions (and it was a Friday afternoon!)

The blog idea I've had in my head for awhile about using targets on everyday workflows (not just "Big Data ™") would include some examples of vanilla memoise and a memoise/targets combo (which can get pretty flexible). As you describe @rburghol , targets isn't a strict caching package, but a proper data pipeline system like targets reduces/eliminates the need for caching (and improperly implemented caching has bitten me more times than I'm happy to admit).

I can appreciate the view that depending on fewer packages feels safer, but I think there can/should be a healthy balance of not re-inventing the wheel if a reliable solution exists.

rburghol commented 7 months ago

For what it's worth, I had to tweak @dblodgett-usgs example slightly (needed to create the cache dir which was absent my install) -- and I took the liberty of renaming, and allowing a cache dir that lived in a neutral location (/media/model/usgs which is an NFS drive that multiple linux nodes share for model data), since much of our code/workflows run as various users (ordinary users or the web server):

Code 1: config.R - a globally used config file.

# Bunch of other shared code goes here
# ....
# end other stuff

# set up a caching function for NWIS and other USGS data services
library("memoise")
dir <- Sys.getenv("USGS_cache_dir")
if (dir == "") {
  dir <- tools::R_user_dir(package="dataRetrieval", which="cache")
}
# create dir if it does not exist
if (!dir.exists(dir)) {
  dir.create(dir, recursive=TRUE)
}
# these could be in a module file
db <- memoise::cache_filesystem(dir)
one_day <- 24*60^2
memo_readNWISdv <- memoise::memoise(dataRetrieval::readNWISdv, ~memoise::timeout(one_day), cache = db)
memo_readNWISsite <- memoise::memoise(dataRetrieval::readNWISsite, ~memoise::timeout(one_day), cache = db)

So then I exchanged memo_readNWISdv() for readNWISdv() in a sript that runs daily and voila. I note:

I have not redone multiple functions to see if it would work across scripts but sure seems like it will (the files persist outside of a given R session)
I omitted the unlink(dir, recursive = TRUE, force = TRUE) in my final function because that forces the cache to be removed, which we don't want (unless testing before and after caching like you were doing).

A cool middle path might be adding a memoise function to DataRetrieval like memo_readNWISdv() so that it could be accessible to all (which would require memoise as a dependency) or spinning a DataRetrievalCaches package.

rburghol commented 7 months ago

BTW @dblodgett-usgs this:

I should start using it in nhdplusTools now that I've seen how it works. Very simple implementation of caching.

I think this would be an amazing bandwidth saver. In fact, I have another project that I am working with that I will encourage to start to play with memoise this week, as requests with that data source are, predictably, subject to timeouts. If you have the inkling to track your work on that project page, I'd love a link.

rburghol commented 7 months ago

Closing as I am guessing this as not to be fixed, but workaround exists.