cboettig / contentid

:package: R package for working with Content Identifiers
http://cboettig.github.io/contentid
Other
46 stars 2 forks source link

registry could be a block chain #6

Closed cboettig closed 4 years ago

cboettig commented 4 years ago

registry file can itself be hashed. Then start a fresh registry that says "it continues from registry with hash xxxx". registry query would then load the chain of registry files.

jhpoelen commented 4 years ago

preston is using a similar approach - for example, see https://github.com/bio-guoda/preston-amazon/blob/master/data/1a/a3/1aa34112ade084ccc8707388fbc329dcb8fae5f895cb266e3ad943f7495740b3 .

jhpoelen commented 4 years ago

more specifically, for the reference to previous version, see https://github.com/bio-guoda/preston-amazon/blob/aa12382fef51cdea8e49651132deb9a78aa71488/data/1a/a3/1aa34112ade084ccc8707388fbc329dcb8fae5f895cb266e3ad943f7495740b3#L10

cboettig commented 4 years ago

Just noting here that while I do think the chain idea you outlined is very compelling and I'd love an R implementation, also not sure it should be part of the core spec, (at least in nquads serialization). Ideally I'd like the registry to be pluggable to any key-value store, (i.e. use Redis if you want), but I also think tsv makes a nice default. At least from the R end of things, we have much better tooling to support tabular serializations than we do for most RDF serializations, and I think on pure performance specs it would be hard to beat a SQL style databases or optimized pure key-value stores. The tsv serialization also helps demonstrate the conceptual simplicity of the whole thing with 'no magic attached'.

I do like the idea you mentioned of being able to exchange registries directly, e.g. by serializing out from preston's registry into a registry.tsv.gz. That also helps underscore the point that the approach is independent from details such as how a specific tool serializes its registry.

cboettig commented 4 years ago

see #25

jhpoelen commented 4 years ago

I still think that the ability to reliable reference, and exchange, content registries is pretty central to the functionality content id package.

Without being able to keeping track of, and reliable reference, the reported association between location and associated content ids, the content id package continues to rely on location-based identifiers and centralized architectures (specific unreliable urls that help to access local/remote registries).

Perhaps some instructive use cases can help guide ideas on how to make it easier to reliably exchange and reference, content registries.

cboettig commented 4 years ago

@jhpoelen yup, I think as per #25 this should be straight forward though? A registry in contentid is just a tsv file (or lmdb database), and it can be referenced, exchanged etc just like any other data. (i.e. I can compute the content id of the .tsv file and post it on some website or data repository, etc).

For example, the registry that I computed locally for a large chunk of DataONE objects has identifier: hash://sha256/7a62443df4472c1c340ef6e60f3949e9e79be73d3d7e60897107fb25d9bb3552 (which can be resolved to a URL at https://hash-archive.org or at https://hash-archive.carlboettiger.info.

As noted in 25, a user could contentid::resolve() this hash to download it, and then just uncompress the tsv and list it's path in registeries() list.

We could block-chain these tsv files, but it adds complexity. Simply listing multiple registries in the argument registries=c("/path/to/reg1/tsv", "/path/to/reg2.tsv") etc has I think the identical effect as far as the functioning of any command in contentid. Chaining registries in this manner also easily extends across different registry types -- i.e. contentid already allows you to chain hash-archive.org registeries and the software-heritage registry as well as both local tsv-backed and lmdb-backed registries in this manner.

What do you think? Does treating the registry like other data sufficient for reliable reference and exchange?

jhpoelen commented 4 years ago

Stepping back a bit, the contentid implementation helped me realize that:

  1. contentids without a registry are like a fish without water: you need the associations documented in the registries provide a meaningful context to the opaque content id in order to use the registered content.
  2. current implementations of registries (e.g., hash-archive, local tsv, dataone) help associate provenance information to a specific location-agnostic content id.
  3. for some reason, the registries (and their provenance data) still depend on location-based identifiers (e.g., url endpoints, local file paths)

In an attempt to continue to use the fish metaphor - in the contentid package, we succeeded in transforming location-based identifiers for datasets (i.e. the fish) to the content-based domain to allow for reliable dataset references, but have not yet done the same with registry/provenance data (the water in which the fish swam).

While having the provenance records (tsv, dataone meta-data, prov-o nquads) link to prior versions or related records using reliable references (to establish a blockchain) would help to establish a history or knowledge graph around the referenced content id, existing non-linked meta-data records can also be reliable referenced by their contentids (meta-data is data also). In other words, putting the provenance/registry into the content-domain (instead of the location-domain) would enable fully offline, decentralized workflows suitable that can be archived, moved, replicated without having to rely on DNS or some other location-based resolver (e.g., a local file system).

So, translating this back into pseudo-code, I'd expect:

some_registry <- contentid::import_registry("hash://sha256/adf") # where the hash is a state or version of the registry 
some_data <- c(1,2,3,4)
some_prov_data <- # prov_data describes process / method / associations relevant to the data
contentid::register(some_data, some_prov_data, some_registry) # this might fail for read-only registries

where register() registers both the data and related prov_data with the registry. If registry implements some sort of linked list of registered content (e.g., blockchain), the hash of the added prov_data (the "head" of the blockchain or linked list) and the hash of the registry itself can be the same, because the tail can be traversed through the back links in the head.

So, this would leave the registration action with two reliable references: (1) the content id of the registered content and (2) the content id of the context/registry/provenance that gives a clue to how the content came about.

This approach would also work with external existing registries like hash-archive and dataone by chaining the the provenance records (history records, eml-like records resp.) in some way.

So, yes, I'd say treating registry like any other data would be sufficient for reliable reference and exchange. However, I'd be curious to hear your thoughts on ideas above. I think it is really important to have it, but perhaps I am seeing bears on the road.

cboettig commented 4 years ago

I think the issue is that our fish can swim in many different lakes?

Right, a content-based identifier cannot be resolve'd without a registry, so a registry is pretty important, though to me your analogy overstates the issue. (I see at least some value in knowing the content hash of the data being used, even without any registry that can resolve said id and give me back the data.)

I think it would be useful to discuss precisely what provenance we're talking about and why. I completely agree that merely being able to resolve an content hash into some bits and bytes is often not sufficient. In the language of FAIR data publishing, it makes it accessible but not findable or reusable, which requires metadata, ideally including data provenance. To me, that means information about where the data came from, how it was collected/transformed/etc. You have rightly argued that this is all just 'more data'. The data can be any bits and bytes, but metadata tries to follow some standard that let repositories or search engines interpret & index it so it is searchable. e.g. I need to know that bytes with hash://sha256/adf are a tsv file and not a jpeg in order to due the right thing with them. But, to me at least, this is all beyond the scope of contentid and addressed by other existing standards.

I think we are talking about the provenance of URL locations used by some registries, but not others. For content accessed by identifier through the DataONE API, or for content accessed through the SoftwareHeritage API, the provenance of the source, as captured by the discussion in #2 don't seem to serve the same purpose as the local registry or the hash-archive.org registry, which both map URLs to hashes seen at those URLs at given date etc.

jhpoelen commented 4 years ago

Thanks for taking the time to reply. At risk of over-communicating, I've included comments to your reply below.

I think the issue is that our fish can swim in many different lakes?

In ideas described earlier, I don't see any issues with a content id having multiple provenance/meta-data associated with them across difference registries. For instance, a dataset associated with some contentid can be registered with dataone, hash-archive, and some a local registry. Perhaps we are not talking about the same thing - perhaps something to discuss in real-time?

Right, a content-based identifier cannot be resolve'd without a registry, so a registry is pretty important, though to me your analogy overstates the issue. (I see at least some value in knowing the content hash of the data being used, even without any registry that can resolve said id and give me back the data.)

I'd argue that doing a search by content id to find a dataset is a registry operation. So, you always need a registry of sorts (e.g., general purpose search engine, local file system, file naming convention, hash-archive) to associate a content id (the fish reference) to it's content (the actual fish, or one of many clones) with a known location (the water, or context in which the fish exists). I consider the association of a content id to a location to be part of the provenance of the content.

I think it would be useful to discuss precisely what provenance we're talking about and why.

To me content provenance is any information associated to the content via a (reliable) reference. This includes a URL the content may have been downloaded from, a description of an activity that may have downloaded, used or produced the content, a date-time string describing when the content was last generated, the size of the content is bytes, mime-type of the content, author of the content etc etc. This metadata or content provenance may adhere to some standard (e.g., EML) or might just be some cooked up custom format (e.g., local tsv file, or lmdb db).

So, in my mind, an EML document that references the content id qualifies as content provenance. Also, a three column single row tsv file with a content id, date stamp and a url is content provenance.

So, when you say:

I completely agree that merely being able to resolve an content hash into some bits and bytes is often not sufficient.

I'd say that you need provenance (e.g., last known location) to be able to find/retrieve/access the content associated to some content id/hash. The content hash itself is just that: a string of characters that make for a reliable content reference.

This is why, when you say,

But, to me at least, this is all beyond the scope of contentid and addressed by other existing standards.

, I'd argue that the current contentid package already includes support for recording, querying and retrieve content provenance/meta-data records through the content registry operations. Minimal provenance, yes, but provenance non-the-less. Removing the registry functionality (incl. using a naming convention to associate content to their hashes using a local file system), would turn contentid into a hash calculator / validator, with one method: calculate_contentid() with no way to resolve ids to their associated content.

Perhaps we should rename the package from contentid to a more generic content to reflect that is does more than just hashing bits 'n bytes.

Curious to hear your thoughts on what you consider provenance and how my notes above relate to that.

Thanks for being patient. I realize that scope, definitions are important to help describe and adopt ways to more reliably work with datasets.