Closed mtmorgan closed 1 month ago
Sorry for the late response - have been absolutely swamped recently. I think all the changes look super, thanks for tidying up the caching logic that I wrote, and for extending it to all get requests.
When I originally wrote the code I though it wasn't worth trying to add a cache argument to each API call (especially because some of the API calls are likely to be cheap), but I like this approach more. My only really substantial comments are about the decision of when to clear the cache.
I imagine that a cache with an expiration date could be frustrating to users as it would make the runtime of their scripts unpredictable. Could we consider having it never clear, and providing a function that wipes the cache at the users request? On a related note, do we know if the expiration date on cached entries will be pushed back if they are frequently used? This is to say, if I cache a large file and use it every 29 days, will I have to re-download it every other time?
These are really just quibbles over a setting though. I think the package is greatly improved by these contributions - thanks for your hard work!
Best, Ben Green
Re:
I imagine that a cache with an expiration date could be frustrating to users as it would make the runtime of their scripts unpredictable. Could we consider having it never clear, and providing a function that wipes the cache at the users request?
"never" sounds good to me. 30 days (or a year) is a decent default too, but I find myself returing to scripts that about a year old or more. Would be good to have a message in the beginning that explains this + how to cler the cache
I'll update the pull request, but perhaps not until early next week
Actually, after chatting with Ben I'd suggest we limit caching to when there is a dataset version is specified. (The default is not to specify). Say V1 is the latest version at time 1 and I cache that dataset with the default version=latest
, and at time 2 there is a new version V2. We wouldn't want the function to keep reading the cache silently.
The main
branch now has some new updates from #136 (including fixes to the actions check yaml) but it seems like I cannot synchronize someone else's fork with that easily. (I tried with rebase and push but I seem to have accidentally created a duplicate branch here instead of update this forked branch). So @mtmorgan, I'll let you synch with main if that's still doable.
I think I can force push to the branch & PR; it would likely cause havoc for anyone with uncommitted changes. Shall I do that?
I can also push a commit that changes cache behavior
get_dataset_*()
have an explicit use_cache =
argument, default to TRUE except when version = ":latest"
etc; disk cache never expires.All that sounds perfect. And yes please feel free to force push.
I've force-pushed the branch, but am still updating the caching stuff.
Sorry for the delay; I've added cache management functionality and implemented a (permanent) disk cache for explicitly versioned datasets, a session cache for other API calls, and the ability to turn off cache use entirely.
Thanks! This looks great and I've added more documentation on the version argument, as well as mention this in the vignette. I've given this a new package version number too. I'll plan on merging once tests pass.
This builds on a proposal in a #112 comment (by caching API calls, rather than faster JSON parsing) and closes #135 (
get_file_by_id()
is a particular example of an API call). It also closes #78 with additional documentation.The first three commits are made redundant with pull request #136, and would be removed before final merge.
Some comments
use_cache = FALSE
to disable the cache for that call, orSys.setenv(DATAVERSE_USE_CACHE = FALSE)
to disable use of the cache across the session.:latest
version?) for individual calls (e.g.,get_file_by_id()
could detect that the user hasn't specifieduse_cache=
, and adjust based on whatever is appropriate for that particular call).Please ensure the following before submitting a PR:
/R
not/man
and rundevtools::document()
to update documentation/tests
for any new functionality or bug fixR CMD check
runs without error before submitting the PR