Open beniaminogreen opened 1 month ago
Some comments on 5153bb212c316a9c650b1b86cecb9bef610691ce meant to be helpful from my past experience; hope they come across in the right spirit --
tools::R_user_dir()
instead of rappdirs to avoid the need for an additional package installationget_file_by_id()
/ get_file_by_id_memoized()
so that, for instance with auto completion during code development, it's clear that there are different versions of get_file_by_id*()
.minimize the 'fancy stuff' that goes on in .onLoad()
to be just, e.g., get_file_by_id_meomoized <<- memoised(get_file_by_id_non_memoised)
. Put everything else in R/get_file_by_id.R where some future developer is most likely to see the code, and where standard tooling (e.g., code coverage, use of variables) will definitely work
#' @export
get_file_by_id <- # the code to execute cached-or-not
## comment indicating what's going on here
get_file_by_id_non_memoized <- # current get_file_by_id
get_file_by_id_memoized <- NULL
get_file_by_id()
seen by the user now just has arguments ...
(see the man page diff), but one really needs to go to the effort of exposing named arguments to users.get_file_by_id
; this argument could be added to the function definition (after ...
?), which would be informative to the user and simplify the code (no need for names(list(...))
just if (identical(version, ...)) ...
or whatever; also one can easily validate that version
is an appropriate value when invoked by the user). But actually it seems like a specific value of version
would be required ("latest"
?) and maybe it is better to have an argument like use_cache = TRUE
instead. I would suggest putting use_cache = TRUE
after the ...
in the function definition, so that the user is given a hint that this is somehow a control argument rather than an argument to get_file_by_id
per se. A reasonable default might be Sys.getenv("DATAVERSE_USE_CACHE", TRUE)
allowing an environment variable override.I'd also really encourage keeping the housekeeping (e.g., removing trailing whitespace in the DESCRIPTION) to a separate commit (and eventual pull request), and editing the commit history on the branch so that it does not contain extraneous (commit one does whitespace changes and other things, commit two undoes the whitespace changes) steps.
Much appreciated. Ben or I will consider soon. Properly tracking the version argument has also been an open issue and it needs documentation (with mwe, https://github.com/IQSS/dataverse-client-r/issues/78). I imagine defaulting to "latest" will allow the default to automatically detect new versions, but we should test together.
Thanks for contributing. I will have a look at these suggestions and get back to you soon. @kuriwaki not sure if I mentioned this before, but I had turned on caching only if you specified an exact version of the file. I agree that what you are suggesting is a sensible default - it would be nice to be able to cache the lastest version of the file on disk and only re-download when the file is updated on the Dataverse.
It would be useful if the package were able to cache results of calls to the DataVerse API to disk. If the same dataset is requested twice, the result can then be served from the disk instead of re-downloading which would save a lot of time.
Here's a sketch of how I think the behavior could work:
Suggested new behavior for
get_file_by_id
:More complex behavior could be added on in the future such as caching the latest version by checking if the file metadata has changed since the last download, and only re-downloading if there is a new version.
I am prototyping the behavior in the cache branch, and would love feedback on the behavior. Right now, I am caching all calls to
get_file_id
which have a specific version of the file specified. I will shortly add step 2 which checks if the user wants to turn off caching if we think this is a good way to go.Best, Ben