IQSS / dataverse-client-r

R Client for Dataverse Repositories
https://iqss.github.io/dataverse-client-r
61 stars 25 forks source link

Adding caching to `get_file_by_id` #135

Open beniaminogreen opened 1 month ago

beniaminogreen commented 1 month ago

It would be useful if the package were able to cache results of calls to the DataVerse API to disk. If the same dataset is requested twice, the result can then be served from the disk instead of re-downloading which would save a lot of time.

Here's a sketch of how I think the behavior could work:

Suggested new behavior for get_file_by_id:

  1. Check we are asking to download a specific version of a file. If we are, proceed to the next step. If we are asking for the latest version, the result cannot be cached as the latest version might change between function invocations.
  2. Check if caching is disabled through some sort of Environment variable. If it is, don't cache results either.
  3. Download the file and cache the results.

More complex behavior could be added on in the future such as caching the latest version by checking if the file metadata has changed since the last download, and only re-downloading if there is a new version.

I am prototyping the behavior in the cache branch, and would love feedback on the behavior. Right now, I am caching all calls to get_file_id which have a specific version of the file specified. I will shortly add step 2 which checks if the user wants to turn off caching if we think this is a good way to go.

Best, Ben

mtmorgan commented 1 day ago

Some comments on 5153bb212c316a9c650b1b86cecb9bef610691ce meant to be helpful from my past experience; hope they come across in the right spirit --

I'd also really encourage keeping the housekeeping (e.g., removing trailing whitespace in the DESCRIPTION) to a separate commit (and eventual pull request), and editing the commit history on the branch so that it does not contain extraneous (commit one does whitespace changes and other things, commit two undoes the whitespace changes) steps.

kuriwaki commented 12 hours ago

Much appreciated. Ben or I will consider soon. Properly tracking the version argument has also been an open issue and it needs documentation (with mwe, https://github.com/IQSS/dataverse-client-r/issues/78). I imagine defaulting to "latest" will allow the default to automatically detect new versions, but we should test together.

beniaminogreen commented 10 hours ago

Thanks for contributing. I will have a look at these suggestions and get back to you soon. @kuriwaki not sure if I mentioned this before, but I had turned on caching only if you specified an exact version of the file. I agree that what you are suggesting is a sensible default - it would be nice to be able to cache the lastest version of the file on disk and only re-download when the file is updated on the Dataverse.