bio-oracle / biooracler

R package to access Bio-Oracle data via ERDDAP
Other
8 stars 2 forks source link

Check if data has already been downloaded #12

Closed jflowernet closed 4 weeks ago

jflowernet commented 4 months ago

First, thanks for biooracler! It is great to be able to do spatial queries on the Bio-Oracle datasets within R.

I may have missed something, but it doesn't seem like download_layers() checks first to see if the data has already been downloaded? This would be a great enhancement and avoid unnecessary calls to the ERDDAP sever. The marmap package does this by creating a filename built from the query parameters and checking to see if that file already exists before executing the data download, see here.

salvafern commented 1 month ago

Hi @jflowernet

Thank you for the nice words and apologies for the late reply (I didn't get a notification for some reason). Do you have a reproducible example? It is odd as this package is just a lightweight wrapper of rerddap, which already nicely deals with caching.

See reprex below:

library(biooracler)

# Download average air temperature data
dataset_id = "tas_baseline_2000_2020_depthsurf"
variables = c("tas_mean")

# Decade 2000-2010
time = c('2001-01-01T00:00:00Z', '2010-01-01T00:00:00Z')

# Select northern hemisphere
latitude = c(0, 89.975)
longitude = c(-179.975, 179.975)

# Set up constraints
constraints = list(time, latitude, longitude)
names(constraints) = c("time", "latitude", "longitude")

# Make sure cache is purged
rerddap::cache_delete_all()

# Perform download as netcdf ~25 seconds
system.time({
  layer <- download_layers(dataset_id, variables, constraints, fmt = "nc")
})
#> Selected dataset tas_baseline_2000_2020_depthsurf.
#> Dataset info available at: http://erddap.bio-oracle.org/erddap/griddap/tas_baseline_2000_2020_depthsurf.html
#> Selected 1 variables: tas_mean
#>    user  system elapsed 
#>   3.889   1.112  24.851

# Check cached files
rerddap::cache_list()
#> <rerddap cached files>
#>  NetCDF files: 
#>      c31e29ee465de830a8bd07e7512fdf48.nc
#>  CSV files:

# Check cache path
rerddap::cache_info()
#> $path
#> [1] "/tmp/RtmpbyBpQW/R/rerddap"
#> 
#> $no_files
#> [1] 1

# Get path to cached file
layer_path <- rerddap::cache_details(layer)[[1]]$info$filename
file.exists(layer_path)
#> [1] TRUE

# Download one more time - note the execution time drops to ~1 second
system.time({
  layer <- download_layers(dataset_id, variables, constraints, fmt = "nc")
})
#> Selected dataset tas_baseline_2000_2020_depthsurf.
#> Dataset info available at: http://erddap.bio-oracle.org/erddap/griddap/tas_baseline_2000_2020_depthsurf.html
#> Selected 1 variables: tas_mean
#>    user  system elapsed 
#>   0.520   0.418   1.095

Created on 2024-10-18 with reprex v2.1.1

jflowernet commented 4 weeks ago

Thanks for the response. It's good to know that biooracler uses rerddap under the hood.

Running your example and some of my own, I can see that the caching is working. I probably wasn't seeing much difference between a first and second call to the servers because my rerddap server calls are in a function which does other time consuming stuff.

Thanks again for biooracler!