registry format / schema

cboettig commented 4 years ago

What schema should we use in a local registry?

We could mimic the schema of hash-archive.org, with 6 fields: url, timestamp, status, type, length, and hashes like so:

[
    {
        "url": "http://cdiac.ornl.gov/ftp/trends/co2/vostok.icecore.co2",
        "timestamp": 1581637415,
        "status": 200,
        "type": "text/plain; charset=UTF-8",
        "length": 11036,
        "hashes": [
            "md5-4nyZp/cB2rl7fQnEZ6z0aA==",
            "sha1-hwnE6Ui6XbPGJT0NATTG8jXV6p4=",
            "sha256-lBIyWDHasiruvdZ0tutTumt73QS7maTbsh3f9kYofjc=",
            "sha384-YlYXQFFqJ+MMfAylc0kWWlj66Jhzm1b1dndnPzFgNMaFqH7b/2FhRfZrN1b1STu9",
            "sha512-86drV5lnde61R+GJxwcgm6ig5Jrnq+jE24NWx0FsT05dwvuJj6tdkMjyXaDNxEl2dN7VtbJlVlI0XGz3csEl"
        ]
    },

Obviously we can represent this in tabular form as well or what not. (We could also define this as a simple S3 class and associate a print method with it which would just show, say, the non-base64 version of the sha-256 string in hash://sha-256/ format...)

I'm aware that we will want to think more about a richer provenance record too, but for the moment I think of that as a higher layer, and want to get this layer right first.

For instance, if we define this schema, then I can better define user-facing functions (like pin()) that use this registry as a backend. The functions I want right now are those that would take a content URI as input and return a location (most recent location, most recent on-disk location, online location, etc). Of course that assumes that for any registry I know where 'location' (i.e.e url in the above schema) and timestamp fields can be found.

It would be nice if we used some valid RDF namespace for this too, instead of the arbitrary names used by hash-archive.org -- maybe PROV? (Then I can easily map the hash-archive.org terms into our preferred namespace before displaying them).

cboettig commented 4 years ago

@jhpoelen Can you point me to the vocab you are using for the above?

Couldn't find an obvious mapping all within Prov, probably because Prov intends us to draw on other namespaces. I'd propose borrowing terms directly from Dublin Core here, since that seems like the most widely recognized (and cross-mapped)

timestamp <--> dc:date
type <--> dc:type
length <-->  dc:extent
hashes <--> dc:identifier
url <--> dc:source
status <--> ??

Thoughts on this?

I'm actually not sure how to treat status anyway -- should the local registry be recording 404's on URLs that don't resolve (and thus for which we can't get a content hash?) What about 301? Others? What does status mean for local resources? Anyway, I'm inclined to ignore 'status' for now, but maybe something to revisit when we talk about a richer provenance model. For the moment just getting core registry behavior.

jhpoelen commented 4 years ago

I do have thoughts on this and my current thinking is reflected in the preston implementation. Generally, they follow a process oriented description, rather than a resource centered description. I'd very much like to hear your thoughts on this and figure out a way to align these implementation.

Here goes . . .

Traditionally, with a url resource oriented description, you'd say stuff like:

<some:url> <dc:date> "2019-04-23T19:45:58.388Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<some:url> <dc:identifier> <hash://sha256/123...>

However, the provenance of these entries are not clear. Who determined that the url had an identifier, what does the date mean? This is where prov / pav come in.

<hash://sha256/94a4fc4824d951c0155860e3e5c4c662afcfae55d52a25105bca1cb6a7b3a062> <http://www.w3.org/ns/prov#generatedAtTime> "2019-04-23T19:45:58.388Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<hash://sha256/94a4fc4824d951c0155860e3e5c4c662afcfae55d52a25105bca1cb6a7b3a062> <http://www.w3.org/ns/prov#wasGeneratedBy> <de8b4109-5ca7-4844-b486-56d3a434fc45> .
<http://tb.plazi.org/GgServer/dwca/FFF76241FFB3B773A711FF98DF6BFFF3.zip> <http://purl.org/pav/hasVersion> <hash://sha256/94a4fc4824d951c0155860e3e5c4c662afcfae55d52a25105bca1cb6a7b3a062> .

where is a download process and hash://sha256/94a4... is the retrieved content.

Coming back to your specific examples:

timestamp <--> dc:date - preston uses http://www.w3.org/ns/prov#generatedAtTime .
type <--> dc:type - preston uses http://purl.org/dc/elements/1.1/format
length <--> dc:extent - preston does not record length, but I did need this and was able to derive the length from the stored content hashes. So, technically, you don't need the data, but it might be helpful to add this anyway.
url <--> dc:source - preston uses the http://www.w3.org/ns/prov#used to relate a process to a resource/entity that was used. See example of preston track https://ropensci.org below.
status - preston uses skolemized blanks (see https://www.w3.org/TR/rdf11-concepts/#section-skolemization) to record "holes" in the biodiversity dataset graph caused by 404s or any other error code. There's probably a more elegant way to do this, and I didn't go down the rabbit hole and attempt to model http request / response in rdf. . . yet ; )

Here's an example of preston track https://ropensci.org, line-by-line:

<hash://sha256/7dbc4e6ac915ab8acfea97a82b2252c8c56ac6acfb0e4f89b972d552fa75538d> <http://www.w3.org/ns/prov#wasGeneratedBy> <124b8bdb-00f4-4633-8f16-1d4108a67a63> .

some content was generated by <124b...>

<hash://sha256/7dbc4e6ac915ab8acfea97a82b2252c8c56ac6acfb0e4f89b972d552fa75538d> <http://www.w3.org/ns/prov#qualifiedGeneration> <285cef08-fcfc-4acd-bf51-c1f73d283f72> .

this content (hash://sha256/7dbc...) has a qualified generation related to this product of the download event. This is a way to assign qualities to a generation "event".

<285cef08-fcfc-4acd-bf51-c1f73d283f72> <http://www.w3.org/ns/prov#generatedAtTime> "2020-02-17T15:37:38.214Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> .

here, the qualified generation <285...> is used to describe at what time the generation occurred.

<285cef08-fcfc-4acd-bf51-c1f73d283f72> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Generation> .

This line describes that the <285...> is in fact, a generation (event).

<285cef08-fcfc-4acd-bf51-c1f73d283f72> <http://www.w3.org/ns/prov#activity> <124b8bdb-00f4-4633-8f16-1d4108a67a63> .

Now, the generation event is related to the download specific download event.

<285cef08-fcfc-4acd-bf51-c1f73d283f72> <http://www.w3.org/ns/prov#used> <https://ropensci.org> .

The generation event used resource https://ropensci.org .

<https://ropensci.org> <http://purl.org/pav/hasVersion> <hash://sha256/7dbc4e6ac915ab8acfea97a82b2252c8c56ac6acfb0e4f89b972d552fa75538d> .

With this, the claim can be support that the resource accessed via https://ropensci.org has version hash://sha256/7dbc... .

The last line is more of a shortcut and can be directed from the qualified generation event.

jhpoelen commented 4 years ago

Note that with the "preston" approach, you can still create familiar tables like:

url	version	timestamp	type
https://example.org	hash://sha256/123...	2020-01-01T...	application/dwca

without being locked into that schema. The table becomes a user interface on top of the download (or generation) event in the provenance log.

cboettig commented 4 years ago

@jhpoelen Thanks very much for this.

Yup, I completely agree that we can create familiar tables and alter the schema later (i..e. have tools that translate those terms into schema terms, just like we map from one schema to another). And obviously that's what hash-archive.org has already done under the hood. For the moment that's what I'm doing, I'm not worried about having full URIs or actual RDF for the 'internal' registry; I just need to call these fields something. I guess I could just stick with the terms hash-archive.org is using for the 'internal' table schema, and kick this discussion down the road until I'm actually implementing this a provenance and storing this as quads (more on that for a separate thread).

However, I'd still prefer terms in a table match a standardized schema (and preferably just one namespace). As you know, from a tooling perspective it's nice if these terms are somewhat consistent (I know one of the main points of RDF is that I can call it schema:name and you can call it dc:title and the computer can read the OWL file and know these are the same).

So, in concrete terms:

the local registry is storing both urls and local paths in the relevant column, so URL doesn't seem like a great heading for this. I think of source as pretty close to the right thing (i.e. that is, dc:source, though I'll just go with bare name in the implementation -- I after all really like the way JSON-LD lets us qualify namespaces with a context instead of prefixing everything!). But I could be convinced there's a better term.
timestamp is nice, more specific than date (though a dateTime is still technically a dc:date). Maybe I should call it generatedAtTime though, since I'm being fussy (or maybe just silly) here and trying not to define more informal terms.
filesize: Ok, dc:extent is a terrible term for this. And like you say, this isn't really core registry information, even if it is quite useful metadata. I'm thinking I should drop this from the registry functions for now. Does that sound reasonable?
type I'm not clear if dc:format or dc:type is preferable here. I've always been on the fence about recording this, because right now I have the registry just guessing this information using the mime package (basically a file-extension lookup), which seems unwise. Still, knowing what function to use to read/parse a content blob seems pretty fundamental, so I guess we should keep the column. (I know there's other ways like 'magic numbers' to sometimes infer application type, but not sure there's a robust solution here).
hasVersion That's definitely an interesting column name for the hash://sha-256/xx strings. Version is a concept the scientific research audience would be familiar with, though of course it's usually used in a looser context. I'm a bit uncomfortable with the implication that the URL is the 'subject' of the sentence though (in the RDF sense). I really want to think of the content hash as the subject (the @id in JSON-LD notation). This was also discussed in https://github.com/schemaorg/schemaorg/issues/1831 which seems to lean to using identifier over version for the content hash identifier.

Related to the last one, I've struggled a bit on what to call these things in the documentation ("hash URIs? content hash? content hash URI? ....) I think most researchers actually aren't familiar with the term URI, but I kinda like calling it an identifier, with the pitch being that this can play may of the same roles that the research community has come to associate with a DOI. maybe that's a discussion for a separate issue.

jhpoelen commented 4 years ago

Thanks for sharing your desire to use a single, well-defined schema as the basis for a table view of a registry.

Here is the most basic table I can come up with using the PROV schema. The idea is that registries record the process of generating content-based identifiers in the form of hash uris. So, each creation of a content uri is unique and gets a uuid (we may want to hide this uuid for simplified view). Then, the url used in the process generated the content uri that ended at provided time.

You can rename columns to use friendlier labels, but they are well defined.

hash generation activity uuid	http://www.w3.org/ns/prov#used	http://www.w3.org/ns/prov#generated	http://www.w3.org/ns/prov#endedAtTime
some activity uuid	https://example.org	hash://sha256/...	"2020-02-17T15:37:38.214Z"^^<http://www.w3.org/2001/XMLSchema#dateTime

cboettig commented 4 years ago

Based on feedback from #9 now implemented in #11, I agree that the core registry concept doesn't need a type column, and details of that (mime type, compression type, and other information) can be left to separate content metadata (DCAT2 seems reasonably apt). So our core columns (predicates) for a registry are then identifier, source, and date. (probably fine for terms to differ in implementations, clearly these aren't the same ones hash-archive.org uses already, but they have nice meanings by themselves or as dublin core).

You're probably right that each row in the registry.tsv.gz should get a uuid, though I'm not wild about it. If users aren't used to that notion, it may be confusing -- is the uuid a row identifer or an identifier to the content? or the source location? Why uuid? If I wanted to refer to a row uniquely, wouldn't it be philosphically consistent to use the 'content hash' of that row (yeah I realize it's serialization dependent). It adds a dependency and some computational overhead, though no doubt negligible. also I haven't entirely wrapped my head around the use case of being able to refer to a generation in the registry by a uuid or global identifier instead of something that's left as internal (i.e. these are just blank nodes in the registry, which can already can have it's own identifier).

cboettig commented 4 years ago

I'm slowly coming around to the realization that you are again correct! we probably do want these uuids, even if we hide them from casual display....

cboettig commented 4 years ago

@jhpoelen I think we still want to think more about registry semantics. I like https://github.com/cboettig/contenturi/issues/5#issuecomment-587164324, but I'm not sure it's quite accurate.

@mbjones suggests we consider a prov:qualifiedUsage terms instead. Example concerns:

In the proposed triple,

uuid prov:used http://example.org

uses http://example.org is an @id (an entity identified by that URI), which is not the sense we have in mind. (In the most literal sense, http://example.com is not a URI of a prov:Entity, but just a string of input text we supplied to the curl function at a particular time). Likewise, prov:generated is ambiguous (did the action 'generate' a text string or an entity? The semantics imply that it generated the entity identified by this URI.

On the flip side, I'm still not sure how to write this out with qualifiedUsage and qualifiedGeneration, and I'm not sure that those PROV semantics are really proving fit-for-purpose here. I'd like the intent of a semantically meaningful registry, but am now tempted to say it's just a key-value store used by an software application any attempt to infer semantics is in the eyes of the beholder.

jhpoelen commented 4 years ago

The qualitiedGeneration can already express the used relation:

from preston track https://example.org

<hash://sha256/ea8fac7c65fb589b0d53560f5251f74f9e9b243478dcb6b3ea79b5e36449c8d9> <http://www.w3.org/ns/prov#qualifiedGeneration> <23ee3afa-8b78-4782-87c8-4f21272a36e4> .
<23ee3afa-8b78-4782-87c8-4f21272a36e4> <http://www.w3.org/ns/prov#generatedAtTime> "2020-02-21T22:35:11.970Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<23ee3afa-8b78-4782-87c8-4f21272a36e4> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Generation> .
<23ee3afa-8b78-4782-87c8-4f21272a36e4> <http://www.w3.org/ns/prov#activity> <2d3aa8f5-28ac-48c1-9e53-7cf46b9bd757> .
<23ee3afa-8b78-4782-87c8-4f21272a36e4> <http://www.w3.org/ns/prov#used> <https://example.com> .

But perhaps there's better ways to express the relation between the generation of the hash and the resource used to generate it.

jhpoelen commented 4 years ago

so, the generation event uses a resource (e.g., https://example.org) and generates a content hash uri.

cboettig commented 4 years ago

@jhpoelen ah nice, that looks pretty reasonable to me at least

cboettig / contentid

registry format / schema #5