cboettig / contentid

:package: R package for working with Content Identifiers
http://cboettig.github.io/contentid
Other
46 stars 2 forks source link

Support for multiple hashes in the local registry #46

Closed cboettig closed 4 years ago

cboettig commented 4 years ago

This PR adds support for using multiple hashes in the local registry (md5, sha1, sha256, sha384, sha512). hash-archive.org already computes all of these hashes when registering any content.

By default, the local registry will still only compute the sha256 hash, as will the basic content_id() function. This should preserve performance, particularly in cases involving very large files.

Still, as discussed in #38, computation of multiple hashes is implemented to be as resource-efficient as possible -- all hashes are computed on a single data stream, meaning that we do not have to download the object more than once and we never have to store the whole object in memory. This would allow a contentid client to compute all 5 hashes on a data file that is much larger than available disk space in a single download.

Here's an example that computes all hashes:

content_id("https://zenodo.org/record/3678928/files/vostok.icecore.co2",
                   algos = c("md5", "sha1", "sha256", "sha384", "sha512"))

return type is a data.frame, with additional rows if input is a vector. The default algorithms can also be set as a comma-delimited string with env var CONTENTID_ALGOS

~Hashes are currently being stored in the internal registry as additional columns (in SRI's base64 format, though since the registry is stored with gz compression it's possible we'd be better off just keeping this in hashuri format for consistency. however, the SRI format is also the internal format returned by hash-archive.org, though of course we can translate between these)~ Edit: hashes are always stored and displayed in hash:// format.

Currently query_history() will show these columns by default for a given source. query_sources() does not show them by default, but they can be specified in the column list to get them back.

Examples in the unit tests, but otherwise this isn't currently clearly documented. Will probably be discussed in a separate vignette.

cboettig commented 4 years ago

cc @mbjones re support for multiple hash algorithms