AtlasOfLivingAustralia / galah-R

Query living atlases from R
https://galah.ala.org.au
39 stars 3 forks source link

Improve caching #55

Closed matildastevenson closed 2 years ago

matildastevenson commented 3 years ago

Queries to the ALA can take a long time to run if lots of data is requested. galah provides caching functionality to reduce the need to run large queries multiple times. By default, caching is turned off. ala_config() is used to turn caching on by setting caching to TRUE, and providing a path to a directory where the cached files should be stored. To keep caching on between sessions, users can provide a path to a .Rprofile file in the call to ala_config().

ala_config(caching = TRUE, cache_directory = "cached_data/", profile_path = ".Rprofile")

How caching currently works

To retrieve cached data, currently you need to run the same ala_ query as was originally run. For example:

# Set caching on
ala_config(caching = TRUE, verbose = TRUE)
# First time running this, the data will be downloaded from the ALA and stored
# in the provided cache directory
occ <- ala_occurrences(
  taxa = select_taxa("Sarcophilus harrisii"),
  filters = select_filters(year = 2020)
)
attributes(occ)
# Second time running this, the function will use the file from the cache
# directory
occ_cached <- ala_occurrences(
  taxa = select_taxa("Sarcophilus harrisii"),
  filters = select_filters(year = 2020)
)

attributes(occ_cached)

Attributes are different because search_url is not currently added to cached files.

How files could be cached

ala_ functions cache files as .rds files. When the call is made again, the data is loaded from the cache. For data such as that returned from ala_occurrences(), this will include attributes such as the DOI (if generated), and the search url. The cache filename is a sorted hash of the function called and the arguments passed to the function. This means that providing the same arguments to the function in a different order will not re-download the data, but providing additional arguments e.g. columns will. An additional function, find_cached_files() could also be created. Running this with no arguments would return a data.frame of the ids of cached files and the function calls used to generate the files

find_cached_files <- function(id) {
  if (missing(id)) {
    # return a data.frame of all cached files
    files <- list.files(getOption("galah_config")$cache_directory)
    cached_df <- data.table::rbindlist(lapply(files, function(f) {
      df <- readRDS(file.path(getOption("galah_config")$cache_directory, f))
      data.frame(file_id = f, search_url = attr(df, "search_url"))
    }))
  }
}

Attributes to store on data

Currently, returned occurrence data contains as attributes the searchurl and DOI, if generated. Given there are a finite number of arguments to `ala functions, these could also be stored as attributes of thedata.frame`.