Closed cboettig closed 4 years ago
@cboettig thanks for sharing - Very cool to see the implementation coming along and to see that the store/registry abstractions work nicely.
Initially, I was trying to comment on the individual commits, but github gave http 500 codes ; ( . Before going into details on the specific implementation, I am merging the changes into main and enable travis to check that the tests work on a non-developer system. Then, I'll comment more specifically on the implementation / abstractions etc.
@jhpoelen curious what you think if this draft.
I've created basic implementations of the four verbs we've discussed:
store
,retrieve
,query
, andregister
. Most uses would just needstore
andretrieve
.The examples below should all run if you install the branch, e.g.
remotes::install_github("cboettig/contenturi@names")
register
register()
only work with URIs? (Yes) (i.estore
can add entries to the local registry that point to local store / cache, but users can't register local paths manually)data.frame
(likequery
) or just the content identifier (currently it's just content identifier)register_local()
andregister_remote()
?query
identifier
,source
,date
,type
)register()
to support multiple registeries, see below.register
, the subroutinesquery_local()
andquery_remote()
are also exposed.store
store("http://cdiac.ornl.gov/ftp/trends/co2/vostok.icecore.co2")
register both the URL and the local storage location? (Yes)store("http://cdiac.ornl.gov/ftp/trends/co2/vostok.icecore.co2")
have an option to also call to the remote registry (when storing a URL)? (No)store
allow for the user to indicate which (local) registry &/or store to use? What does the interface for this look like? (Currently:dir = app_dir()
-- you can set the directory used for store+registry, but can't mix & match them)retrieve
I think retrieve may need the most input. It's essentially a thin wrapper around
query
that returns a single location instead of adata.frame
of all matching identifiers / URLs. It lets the user specifyprefer = c("local", "remote")
indicating that local storage entries should be tried (first, or second) to URLs. If there are multiple entries, it will go by most recent first.prefer
can be set in different order, or can insist onremote
orlocal
only. (perhaps this should be more general to have preferences to different repositories, but I'm not sure, that could complicate the user interface too much?)For URL content, it will download the object and by default (
verify = TRUE
) validate that the downloaded object matches the registered hash. If not, it will proceed to try the next entry (though currently without any message /warnings). Currently, for speed,verify_local = FALSE
, meaning locally stored content is not validated (since it is stored by hash.)multiple registries model:
retrieve
,register
, andquery
take an argument calledregistries
. This is a list of known registries, with defaults provided bydefault_registries()
which can also be set as a,
-separated list with the env varCONTENTURI_REGISTRIES
. Local registries are specified merely by a (writeable) local path, remote registries by a URL (currentlyhttps://hash-archive.org
is the only URL recognized). I'm not sure why you'd want to record in multiple local registeries at once, but it's possible under this model. Note thatstore
does not tageregisteries
as an argument, but instead takes a single path to a localdir
(sincestore
has no operations forhash-archive.org
remotes) Feedback welcome.Not implemented
Provenance metadata and higher-level, metadata / prov-based
store
andretrieve
operators. The current model is obviously so thin on metadata / provenance as to be nearly useless by itself, because we have little idea what content we are getting. Clearly the idea is that the user would also generate a richer metadata / prov file accompanying every stored 'data' content object, and then would use a query system that would query against this rich prov metadata to determine which content was desired. (This metadata could also be adjusted to the needs of different scientific repos, e.g. dataONE vs Zenodo vs custom applciation etc). This functionality may fall outside the scope of this package though and be better implemented in (domain / application-specific?) seperate packages?Abstractions for storage and registry. Currently registry is a
registry.tsv.gz
file inapp_dir()
andstore
is files named by hash using subdirs. It would be nice to have better abstractions for these that allowed any pluggable key-value store method for the registry (and store). (seestorrr
package)