cboettig / contentid

:package: R package for working with Content Identifiers
http://cboettig.github.io/contentid
Other
46 stars 2 forks source link

Basic implementations of core verbs #9

Closed cboettig closed 4 years ago

cboettig commented 4 years ago

@jhpoelen curious what you think if this draft.

I've created basic implementations of the four verbs we've discussed: store, retrieve, query, and register. Most uses would just need store and retrieve.

The examples below should all run if you install the branch, e.g. remotes::install_github("cboettig/contenturi@names")

register

register("http://cdiac.ornl.gov/ftp/trends/co2/vostok.icecore.co2")
  1. Does this register to both remote and local registries by default? (Yes)
  2. Does register() only work with URIs? (Yes) (i.e store can add entries to the local registry that point to local store / cache, but users can't register local paths manually)
  3. Does register return all registered metadata (source, identifier, date, type) in a data.frame (like query) or just the content identifier (currently it's just content identifier)
  4. Do we expose the sub-routines, register_local() and register_remote()?

query

query("hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37")
query("http://cdiac.ornl.gov/ftp/trends/co2/vostok.icecore.co2")
  1. is the return data.frame structure okay? (data from all registries listed, fields are identifier, source, date, type)
  2. Currently echoes the interface used by register() to support multiple registeries, see below.
  3. Like register, the subroutines query_local() and query_remote() are also exposed.

store

store("http://cdiac.ornl.gov/ftp/trends/co2/vostok.icecore.co2")
f <- system.file("extdata", "vostok.icecore.co2", package="contenturi", mustWork = TRUE)
store(f)
  1. Does store("http://cdiac.ornl.gov/ftp/trends/co2/vostok.icecore.co2") register both the URL and the local storage location? (Yes)
  2. Should store("http://cdiac.ornl.gov/ftp/trends/co2/vostok.icecore.co2") have an option to also call to the remote registry (when storing a URL)? (No)
  3. Does store allow for the user to indicate which (local) registry &/or store to use? What does the interface for this look like? (Currently: dir = app_dir() -- you can set the directory used for store+registry, but can't mix & match them)

retrieve

I think retrieve may need the most input. It's essentially a thin wrapper around query that returns a single location instead of a data.frame of all matching identifiers / URLs. It lets the user specify prefer = c("local", "remote") indicating that local storage entries should be tried (first, or second) to URLs. If there are multiple entries, it will go by most recent first. prefer can be set in different order, or can insist on remote or local only. (perhaps this should be more general to have preferences to different repositories, but I'm not sure, that could complicate the user interface too much?)

For URL content, it will download the object and by default (verify = TRUE) validate that the downloaded object matches the registered hash. If not, it will proceed to try the next entry (though currently without any message /warnings). Currently, for speed, verify_local = FALSE, meaning locally stored content is not validated (since it is stored by hash.)

retrieve("hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37")
retrieve("http://cdiac.ornl.gov/ftp/trends/co2/vostok.icecore.co2")

multiple registries model:

retrieve, register, and query take an argument called registries. This is a list of known registries, with defaults provided by default_registries() which can also be set as a ,-separated list with the env var CONTENTURI_REGISTRIES. Local registries are specified merely by a (writeable) local path, remote registries by a URL (currently https://hash-archive.org is the only URL recognized). I'm not sure why you'd want to record in multiple local registeries at once, but it's possible under this model. Note that store does not tage registeries as an argument, but instead takes a single path to a local dir (since store has no operations for hash-archive.org remotes) Feedback welcome.

Not implemented

jhpoelen commented 4 years ago

@cboettig thanks for sharing - Very cool to see the implementation coming along and to see that the store/registry abstractions work nicely.

Initially, I was trying to comment on the individual commits, but github gave http 500 codes ; ( . Before going into details on the specific implementation, I am merging the changes into main and enable travis to check that the tests work on a non-developer system. Then, I'll comment more specifically on the implementation / abstractions etc.