Issue 112/135 caching memoise api get

mtmorgan commented 2 months ago

This builds on a proposal in a #112 comment (by caching API calls, rather than faster JSON parsing) and closes #135 (get_file_by_id() is a particular example of an API call). It also closes #78 with additional documentation.

The first three commits are made redundant with pull request #136, and would be removed before final merge.

Some comments

caching is enabled by default. All functions making GET calls can add use_cache = FALSE to disable the cache for that call, or Sys.setenv(DATAVERSE_USE_CACHE = FALSE) to disable use of the cache across the session.
enabling caching by default seems pretty reasonable, since the API & most datasets / etc., are only changing during the short period when they are 'under development'.
one could implement more complicated logic (e.g., don't use caching for :latest version?) for individual calls (e.g., get_file_by_id() could detect that the user hasn't specified use_cache=, and adjust based on whatever is appropriate for that particular call).
cached elements are stored on-disk for 30 days, so they are not permanent.
perhaps some effort might be invested in more explicit cache management, e.g., providing the user with a transparent way to flush the cache.

Please ensure the following before submitting a PR:

[x] if suggesting code changes or improvements, open an issue first
[x] for all but trivial changes (e.g., typo fixes), add your name to DESCRIPTION
[x] for all but trivial changes (e.g., typo fixes), documentation your change in NEWS.md with a parenthetical reference to the issue number being addressed
[x] if changing documentation, edit files in /R not /man and run devtools::document() to update documentation
[x] add code or new test files to /tests for any new functionality or bug fix
[x] make sure R CMD check runs without error before submitting the PR

beniaminogreen commented 2 months ago

Sorry for the late response - have been absolutely swamped recently. I think all the changes look super, thanks for tidying up the caching logic that I wrote, and for extending it to all get requests.

When I originally wrote the code I though it wasn't worth trying to add a cache argument to each API call (especially because some of the API calls are likely to be cheap), but I like this approach more. My only really substantial comments are about the decision of when to clear the cache.

I imagine that a cache with an expiration date could be frustrating to users as it would make the runtime of their scripts unpredictable. Could we consider having it never clear, and providing a function that wipes the cache at the users request? On a related note, do we know if the expiration date on cached entries will be pushed back if they are frequently used? This is to say, if I cache a large file and use it every 29 days, will I have to re-download it every other time?

These are really just quibbles over a setting though. I think the package is greatly improved by these contributions - thanks for your hard work!

Best, Ben Green