Open maelle opened 4 years ago
the registries vignette is broken. I'd recommend creating a pkgdown website :wink:
Somewhere at the beginning underline the fact that the content could be any file i.e. csv but also image, video, shapefile, etc. since the hashing can work on any?
can the persistent copy be in a project folder (under version control for instance) rather than the local cache?
the local caching of something from an URL makes me think of another WIP package, https://github.com/ropenscilabs/webmiddens/
End of my comments for today!
Thanks so much for this @maelle :tada: :100: It's really helpful to see the kind of questions that arise for you when reading this, and they are all great.
One thing I'm struggling with is the balance between keeping things simple and concise but not making them seem opaque -- I think your suggestions of adding usethis
like interactive messaging might help somewhat, but would also love any suggestions you have about which of these things should be addressed in a README, and which are not so essential and could be put off into a separate vignette (none of which have been written yet, hence those broken links!)
Quick comment on app dir & caching which I'd love your feedback on. Yes, we use the standard app dir (via rappdirs
) by default, which can be overridden by setting the environmental variable CONTENTID_HOME
. This helper function content_dir()
returns the default location (i.e. rappdirs location unless the env variable has been set), and is passed to all functions by default. So you can:
CONTENTID_HOME
in the project or system's .Renviron
file, or register("private_data.tsv", registries = ".")
By default, registries = list(content_dir(), "https://hash-archive.org")
-- i.e. register locally, using the default location, and register on http://hash-archive.org. So the above command says to only register locally (because we omitted hash-archive.org), and using the a registry in the current working directory, "."
instead of the default. (Not sure why you would want to, but you can technically register to multiple local dirs too, register("data.tsv", registries = c("/path/A", "/path/B", "https://hash-archive.org")
.
Currently, content_dir()
location is used for both the registry and the optional file cache. Good question about webmiddens
and versioning -- I probably need to point out that the local cache is "content-based" -- literally, the store puts an object with hash://sha256/efab3a7gh....
in the location ef/ab/efab3a7gh...
. That means that a new version goes to a new location because it has a new hash. We're only caching actual data files, not caching web requests because resolve()
isn't necessarily making any web requests to begin with, so we don't care at all about stuff like expiry-time
, cache-control
, eTag
, etc -- by using the sha256 hash, resolve()
knows with cryptographic surety that it is getting the requested content.
I'm not sure if this belongs in the readme, but it might be worth spelling out that resolve()
isn't like httr::GET
, download.file
, webmiddens or pins::pin()
in that it doesn't operate on URLs, you cannot resolve("http://example.com")
, you have to give it hash://
id. Starting from a URL is technically: register() %>% resolve()
. In fact, if you do:
register("https://example.com", registries = "https://hash-archive.org") %>%
resolve(store=TRUE)
You will then have the behavior of webmiddens
or pins
but with cryptographically solid (but much slower!) cache-control. The above code would cause hash-archive.org to download the URL content and hash it, and give you the identifier. resolve()
would then get the URL back from hash-archive.org the first time you run this, leading to a second download and hash, but would in future be able to load the version from the local store. Unlike the example I show in the readme, this would get the "latest" version at the URL every time, because the id just gets piped through. It could still provide some benefit of a local cache though since hash-archive.org may be much faster at downloading and hashing than your local machine, but it will never be as fast as checking the eTag (assuming an eTag exists). We expose this pattern as contentid::pin()
(a hat-tip that it is a drop-in replacement for pins::pin()
).
A diagram, even if not pretty, would be great.
Where does the local cache live? In an app dir? (I guess so, and wonder whether it should be made clearer)
Regarding register, since it's meant to be run interactively I wonder whether it should behave like usethis' functions with cli messages. e.g. check-mark, registered dataset check-mark, registered with Hash Archive todo, save the hash and use it in e.g. resolve()
should query_sources() have some custom printing "An registered content with hash and X sources"?
should query_sources() have a verbose mode where it says what it's querying, like what we have in dev codemetar? "querying hash archive... querying local cache at... querying Software Heritage"
if the sources live in an app dir, it means that if you move your Rproj or somewhere else, you lose the sources, or can the sources be cached in a project directory too?
" This is because in addition to maintaining a local registry of sources, contentid registers online sources with the Hash Archive, hash-archive.org. " very cool but should a warning be added that it means you shouldn't e.g. use the raw URL to a GitHub private repo because that thing contains a token?
to be continued...