cboettig / contentid

:package: R package for working with Content Identifiers
http://cboettig.github.io/contentid
Other
46 stars 2 forks source link

Other README comments #48

Open maelle opened 4 years ago

maelle commented 4 years ago

to be continued...

maelle commented 4 years ago

End of my comments for today!

cboettig commented 4 years ago

Thanks so much for this @maelle :tada: :100: It's really helpful to see the kind of questions that arise for you when reading this, and they are all great.

One thing I'm struggling with is the balance between keeping things simple and concise but not making them seem opaque -- I think your suggestions of adding usethis like interactive messaging might help somewhat, but would also love any suggestions you have about which of these things should be addressed in a README, and which are not so essential and could be put off into a separate vignette (none of which have been written yet, hence those broken links!)

Quick comment on app dir & caching which I'd love your feedback on. Yes, we use the standard app dir (via rappdirs) by default, which can be overridden by setting the environmental variable CONTENTID_HOME. This helper function content_dir() returns the default location (i.e. rappdirs location unless the env variable has been set), and is passed to all functions by default. So you can:

register("private_data.tsv", registries = ".")

By default, registries = list(content_dir(), "https://hash-archive.org") -- i.e. register locally, using the default location, and register on http://hash-archive.org. So the above command says to only register locally (because we omitted hash-archive.org), and using the a registry in the current working directory, "." instead of the default. (Not sure why you would want to, but you can technically register to multiple local dirs too, register("data.tsv", registries = c("/path/A", "/path/B", "https://hash-archive.org").

Currently, content_dir() location is used for both the registry and the optional file cache. Good question about webmiddens and versioning -- I probably need to point out that the local cache is "content-based" -- literally, the store puts an object with hash://sha256/efab3a7gh.... in the location ef/ab/efab3a7gh.... That means that a new version goes to a new location because it has a new hash. We're only caching actual data files, not caching web requests because resolve() isn't necessarily making any web requests to begin with, so we don't care at all about stuff like expiry-time, cache-control, eTag, etc -- by using the sha256 hash, resolve() knows with cryptographic surety that it is getting the requested content.

I'm not sure if this belongs in the readme, but it might be worth spelling out that resolve() isn't like httr::GET, download.file, webmiddens or pins::pin() in that it doesn't operate on URLs, you cannot resolve("http://example.com"), you have to give it hash:// id. Starting from a URL is technically: register() %>% resolve(). In fact, if you do:

register("https://example.com", registries = "https://hash-archive.org") %>%
 resolve(store=TRUE)

You will then have the behavior of webmiddens or pins but with cryptographically solid (but much slower!) cache-control. The above code would cause hash-archive.org to download the URL content and hash it, and give you the identifier. resolve() would then get the URL back from hash-archive.org the first time you run this, leading to a second download and hash, but would in future be able to load the version from the local store. Unlike the example I show in the readme, this would get the "latest" version at the URL every time, because the id just gets piped through. It could still provide some benefit of a local cache though since hash-archive.org may be much faster at downloading and hashing than your local machine, but it will never be as fast as checking the eTag (assuming an eTag exists). We expose this pattern as contentid::pin() (a hat-tip that it is a drop-in replacement for pins::pin()).