Queries to the ALA can take a long time to run if lots of data is requested. galah provides caching functionality to reduce the need to run large queries multiple times. By default, caching is turned off. ala_config() is used to turn caching on by setting caching to TRUE, and providing a path to a directory where the cached files should be stored. To keep caching on between sessions, users can provide a path to a .Rprofile file in the call to ala_config().
To retrieve cached data, currently you need to run the same ala_ query as was originally run. For example:
# Set caching on
ala_config(caching = TRUE, verbose = TRUE)
# First time running this, the data will be downloaded from the ALA and stored
# in the provided cache directory
occ <- ala_occurrences(
taxa = select_taxa("Sarcophilus harrisii"),
filters = select_filters(year = 2020)
)
attributes(occ)
# Second time running this, the function will use the file from the cache
# directory
occ_cached <- ala_occurrences(
taxa = select_taxa("Sarcophilus harrisii"),
filters = select_filters(year = 2020)
)
attributes(occ_cached)
Attributes are different because search_url is not currently added to cached files.
How files could be cached
ala_ functions cache files as .rds files. When the call is made again, the data
is loaded from the cache. For data such as that returned from ala_occurrences(),
this will include attributes such as the DOI (if generated), and the search url.
The cache filename is a sorted hash of the function called and the arguments
passed to the function. This means that providing the same arguments to the function
in a different order will not re-download the data, but providing additional
arguments e.g. columns will.
An additional function, find_cached_files() could also be created.
Running this with no arguments would return a data.frame
of the ids of cached files and the function calls used to generate the files
find_cached_files <- function(id) {
if (missing(id)) {
# return a data.frame of all cached files
files <- list.files(getOption("galah_config")$cache_directory)
cached_df <- data.table::rbindlist(lapply(files, function(f) {
df <- readRDS(file.path(getOption("galah_config")$cache_directory, f))
data.frame(file_id = f, search_url = attr(df, "search_url"))
}))
}
}
Attributes to store on data
Currently, returned occurrence data contains as attributes the searchurl and
DOI, if generated. Given there are a finite number of arguments to `ala functions, these could also be stored as attributes of thedata.frame`.
Queries to the ALA can take a long time to run if lots of data is requested.
galah
provides caching functionality to reduce the need to run large queries multiple times. By default, caching is turned off.ala_config()
is used to turn caching on by setting caching toTRUE
, and providing a path to a directory where the cached files should be stored. To keep caching on between sessions, users can provide a path to a.Rprofile
file in the call toala_config()
.How caching currently works
To retrieve cached data, currently you need to run the same
ala_
query as was originally run. For example:Attributes are different because
search_url
is not currently added to cached files.How files could be cached
ala_
functions cache files as .rds files. When the call is made again, the data is loaded from the cache. For data such as that returned fromala_occurrences()
, this will include attributes such as the DOI (if generated), and the search url. The cache filename is a sorted hash of the function called and the arguments passed to the function. This means that providing the same arguments to the function in a different order will not re-download the data, but providing additional arguments e.g. columns will. An additional function,find_cached_files()
could also be created. Running this with no arguments would return adata.frame
of the ids of cached files and the function calls used to generate the filesAttributes to store on data
Currently, returned occurrence data contains as attributes the searchurl and DOI, if generated. Given there are a finite number of arguments to `ala
functions, these could also be stored as attributes of the
data.frame`.