IQSS / dataverse-client-r

R Client for Dataverse Repositories
https://iqss.github.io/dataverse-client-r
61 stars 25 forks source link

Issue 112/135 caching memoise api get #137

Closed mtmorgan closed 1 month ago

mtmorgan commented 2 months ago

This builds on a proposal in a #112 comment (by caching API calls, rather than faster JSON parsing) and closes #135 (get_file_by_id() is a particular example of an API call). It also closes #78 with additional documentation.

The first three commits are made redundant with pull request #136, and would be removed before final merge.

Some comments

Please ensure the following before submitting a PR:

beniaminogreen commented 2 months ago

Sorry for the late response - have been absolutely swamped recently. I think all the changes look super, thanks for tidying up the caching logic that I wrote, and for extending it to all get requests.

When I originally wrote the code I though it wasn't worth trying to add a cache argument to each API call (especially because some of the API calls are likely to be cheap), but I like this approach more. My only really substantial comments are about the decision of when to clear the cache.

I imagine that a cache with an expiration date could be frustrating to users as it would make the runtime of their scripts unpredictable. Could we consider having it never clear, and providing a function that wipes the cache at the users request? On a related note, do we know if the expiration date on cached entries will be pushed back if they are frequently used? This is to say, if I cache a large file and use it every 29 days, will I have to re-download it every other time?

These are really just quibbles over a setting though. I think the package is greatly improved by these contributions - thanks for your hard work!

Best, Ben Green

kuriwaki commented 1 month ago

Re:

I imagine that a cache with an expiration date could be frustrating to users as it would make the runtime of their scripts unpredictable. Could we consider having it never clear, and providing a function that wipes the cache at the users request?

"never" sounds good to me. 30 days (or a year) is a decent default too, but I find myself returing to scripts that about a year old or more. Would be good to have a message in the beginning that explains this + how to cler the cache

mtmorgan commented 1 month ago

I'll update the pull request, but perhaps not until early next week

kuriwaki commented 1 month ago

Actually, after chatting with Ben I'd suggest we limit caching to when there is a dataset version is specified. (The default is not to specify). Say V1 is the latest version at time 1 and I cache that dataset with the default version=latest, and at time 2 there is a new version V2. We wouldn't want the function to keep reading the cache silently.

kuriwaki commented 1 month ago

The main branch now has some new updates from #136 (including fixes to the actions check yaml) but it seems like I cannot synchronize someone else's fork with that easily. (I tried with rebase and push but I seem to have accidentally created a duplicate branch here instead of update this forked branch). So @mtmorgan, I'll let you synch with main if that's still doable.

mtmorgan commented 1 month ago

I think I can force push to the branch & PR; it would likely cause havoc for anyone with uncommitted changes. Shall I do that?

I can also push a commit that changes cache behavior

kuriwaki commented 1 month ago

All that sounds perfect. And yes please feel free to force push.

mtmorgan commented 1 month ago

I've force-pushed the branch, but am still updating the caching stuff.

mtmorgan commented 1 month ago

Sorry for the delay; I've added cache management functionality and implemented a (permanent) disk cache for explicitly versioned datasets, a session cache for other API calls, and the ability to turn off cache use entirely.

kuriwaki commented 1 month ago

Thanks! This looks great and I've added more documentation on the version argument, as well as mention this in the vignette. I've given this a new package version number too. I'll plan on merging once tests pass.