IRI/URI Policy - Githubissues

stephenhart8 commented 4 years ago

Guidelines for "cool URIs"

As stated by Tim Berners-Lee and the W3C, URIs should be sustainable identifiers for entities, which means the form of the URI should be chosen carefully.

Real world vs. web

First, there needs to be a difference between the real world thing and the web document describing this thing. For example, the homepage of an artist on CHIN website should be different than the URI of an artist, otherwise, statements on a person would also be applicable to the homepage (and homepages don't get born). http://www.chin-rcip.ca/person/bob - Homepage of the artist bob on the CHIN website http://www.chin-rcip.ca/uri/person/bob - Bob, the real world person

Two kinds of URIs: Hash or Slash

According to the W3C, there are two options for URIs: hash URIs and 303 URIs.

Hash URI

Hash URIs are fragments of an RDF document. When a client wants to retrieve a hash URI, then the HTTP protocol requires the fragment part to be stripped off before requesting the URI from the server. This means a URI that includes a hash cannot be retrieved directly, and therefore does not necessarily identify a Web document. But we can use them to identify other, non-document resources, without creating ambiguity. They take the following form: http://www.chin-rcip.ca/person#bob - Bob the real-life person http://www.chin-rcip.ca/person#alfred - Alfred the real-life person

Hash URIs should be preferred for rather small and stable sets of resources that evolve together. The ideal cases are RDF Schema vocabularies and OWL ontologies.

303 URIs

This kind of URI follows the shape of URL, but the Web server would be configured to answer requests to all these URIs with a 303 status code (redirect) and a Location HTTP header that provides the URL of a document that represents the resource. For example, to redirect from http://www.chin-rcip.ca/uri/person/bob to http://www.chin-rcip.ca/person/bob. Because CHIN will handle large and growing sets of resources, we should use the 303 URI format.

Principals of Cool URIs

Simplicity: Short, mnemonic URIs will not break as easily when sent in emails and are in general easier to remember, e.g. when debugging your Semantic Web server. Stability: Once you set up a URI to identify a certain resource, it should remain this way as long as possible. Think about the next ten years. Maybe twenty. Keep implementation-specific bits and pieces such as .php and .asp out of your URIs, you may want to change technologies later. Manageability: Issue your URIs in a way that you can manage. One good practice is to include the current year in the URI path, so that you can change the URI-schema each year without breaking older URIs. Keeping all 303 URIs on a dedicated subdomain, e.g. http://id.chin-rcip.com/bob, eases later migration of the URI-handling subsystem.

Proposition for CHIN's URIs

With all those guidelines in mind, I would propose some kind of URI close to this:

URIformat-2

This embedded format of URI, with the first part being the ID of the provider, then the ID of the actor would allow us to know relatively easily where an entity is related to.
My examples here work, but I'm not sure linking every entity to actors would work everywhere. We should have a template for each node.
Some URI would not be in numbers (for instance the vocabulary), and would have a simpler format, like: http://www.chin-rcip.ca/uri/type/artist
I don't like UUID format because it is too long and non-understandable for humans.

What are your ideas about the URI template?

illip commented 4 years ago

France's Ministry of Culture published this interesting document regarding URIs management (in french): https://www.culture.gouv.fr/Espace-documentation/Publications-revues/Identifiants-perennes-pour-les-ressources-numeriques

VladimirAlexiev commented 4 years ago

hi @stephenhart8 this is a good overview but I'd propose slightly different IRI patterns. (I use the word IRI because URI is not necessarily resolvable, and URL doesn't allow Unicode chars.)

Consider the base very carefully. It should use https: (more modern and secure, NOM are switching to https), and should not conflict with any existing website (https://www.chin-rcip.ca/ seems to be available)
Use only slash, hash can be used only in small term collections (ontologies) where you want the whole collection to be returned at once.
Put "owned sub-objects" at the end, simulating a folder hierarchy. So /person/1234 is a Person, /person/1234/birth is his Birth event, /person/1234/birth/date is its time-span.
The semantic IRI (against which triples are recorded) should be purest and shortest. So don't add uri/ in there, add something to the web-page. Eg http://vocab.gety.edu/aat/3001234 is the semantic IRI, whereas http://vocab.gety.edu/page/aat/3001234 is the webpage.
You can also consider RDF documents in various formats. One nice way to distinguish them is by extension, which conforms to "put things at the end" and allows a convenient way to download (in addition to content negotiation). For example: /person/1234: semantic URL /person/1234.html: web page /person/1234.ttl: Turtle, includes all "entity data" incl sub-objects like birth, birth/date /person/1234.jsonld: JSON-LD, includes all "entity data"
- See http://vocab.getty.edu/doc/#Semantic_Resolution and the next 2 sections for Getty URL design, serving by extension, conneg, and semantic formats. NOM will conform to this.
Including "provider id" in the IRI depends a lot on the following questions (which merit opening a new issue)
- How many Persons will come from different providers thus are subject to Entity Matching? (i.e. creating person Clusters)
- Are there clearly dominant providers for certain Person populations, so they can seed each cluster?
- How stable are the clusters: if the seed provider drops from a cluster (for a variety of reasons), what do you do?
- Overall, my experience has been that an aggregator like CHIN is better off minting its own cluster IDs and not reusing provider IDs. Witness ULAN and especially VIAF: their IDs are sequential, don't reuse any provider ID, and reasonably stable. (I may describe later what it takes to maintain IRI stability of aggregated records)

VladimirAlexiev commented 4 years ago

A better name for the issue is "IRI Policy"

stephenhart8 commented 4 years ago

@VladimirAlexiev Thank you very much for your comment! You have made some very good points that will be very useful for our next meeting on URI Policy.

Consider the base very carefully

I agree that the base is really important, but I don't know what is the flexibility of the government for this. Is that something we could investigate @illip ?

Put "owned sub-objects" at the end, simulating a folder hierarchy.

The "owned sub-objects" at the end is indeed a good option, more readable for humans and more structured. It seems better than putting all the births in the same "folder".

RDF documents in various formats

I don't know what option would be best, between adding something for the webpage or having the format within the URL. Maybe a combination of both? (add the webpage element in the URL and adding the formats for the datadumps?

Including "provider id" in the IRI depends a lot on the following questions

Indeed, you are right that we should not base the person URI on the provider's ID, as maybe multiple providers will document the same person. There should be another way of generating it. If we use the named graph at the record level, as discussed in issue #45, we would have both the URI of the person based on the provider's ID and a clustered URI for the person that is created by CHIN, if I understood correctly.

VladimirAlexiev commented 4 years ago

having the format within the URL?

The important thing is to have a different URL for the RDF document/record, to separate the business from the bookkeeping data (and as you said: people get born, records get created, and decidedly at different dates). Getty does NOT have such difference, which was a mistake).

Whether to have different URLs per RDF format is a second question. It allows you to describe those different representations more sharply (eg void:uriRegexPattern, dc:format, dcat:byteSize)

datadumps

Note: I call "datadump" a big gob of data, eg the whole CIC dataset. It certainly needs a semantic description: VOID + DCAT2 + maybe ADMS. Do you have a issue for that? http://vocab.getty.edu/doc/#Descriptive_Information is very comprehensive but is old (Mar 2014) and misses important DCAT2 developments.

I call the Person records "semantic entities" or "entity RDF".

URI of the person based on the provider's ID and a clustered URI for the person that is created by CHIN, if I understood correctly.

Yes. With an open-ended set of providers, CHIN must take the neutral/universal approach and mint its own IRIs. Best if they are sequential, UUIDs are disliked.

Habennin commented 4 years ago

The 'owned sub-objects' question is a tricky one in CIDOC CRM.

Recently it has been quite heavily discussed in the Linked.Art meetings regarding dimensions:

https://github.com/linked-art/linked.art/issues/270

You may have to find the actual notes to get the entire gist of it, but the basic problem is that from a curatorial point of view, you only need to consider the dimension as the 'own' property of the object, because they don't really care who measured it when using what instrument etc. But considered from an event point of view, the dimension is not a proper attribute of the object (its own) but is a produced attributed of a measurement.

This leads to a conflict around when to represent such entities in their (apparent) simplicity and when in their full form.

In fact, in CIDOC CRM very little is conceived as 'owned'. This is because of the basic 'event orientation' of the model and modelling strategy. So typically there are things like 'dimension' or 'birth' which are treated as intrinsic properties of the subject being described. In fact this is a convenient way of looking at things and perfectly functional in many implementation scenarios, but, when fully developed into the event logic, lose these entities actually lose their dependence on the individual.

You can see the argument fully developed by M. Doerr here: http://cidoc-crm.org/sites/default/files/KR_and_CM.ppt

So even a birth is not actually just 'my birth' although it is unique to my being birthed, since obviously my mother had to be there and potentially so was my twin.

In practice, I think the solution @VladimirAlexiev Vlad points to is probably a good one, but it is good to be aware of that real complication.

VladimirAlexiev commented 4 years ago

We discussed with @stephenhart8 in #45 what to do about "border" nodes that are shared between several "business objects". The most prominent example are person relations (PC14). If I have Person1-PC14-Person2 then I better emit the PC14 and its triples in the semantic entity (DESCRIBE payload) of both Person1 and Person2.

@Habennin you make a good example of a Birth event, which may be shared between mother, father, and twins: in that case Birth better be in the DESCRIBE payload of all those persons (though somewhat paradoxically, CRM does not consider the father to be directly involved in Birth).

Such decisions should be pragmatic although they are subjective. Eg I'd emit Production events only with the payload of Artwork but not the Creator because the link to Artwork is much stronger, and otherwise the payload of Creator could become too large.

Such pragmatic considerations led us to cut off narrower and narrowerTransitive in Getty TGN: otherwise the payload of a top TGN node like World becomes just too big. I'm not talking about including any data of the narrower subjects, just the narrowTransitive statements are over 2.2M. We keep broader and of course narrower is redundant with it.

Back to measurements and births: most museum data has NO important event info but just the final literals (eg length or date). In such typical cases it's much better to use URLs like <person/123/birth/date>, <object/123/height> instead of eg <measurement/unknownActor/unknownDate/unknownInstrument/object/123/height>

chin-rcip / collections-model

IRI/URI Policy #43

Guidelines for "cool URIs"

Real world vs. web

Two kinds of URIs: Hash or Slash

Hash URI

303 URIs

Principals of Cool URIs

Proposition for CHIN's URIs