Investigate named graphs

philbarker commented 2 years ago

Use cases:

We need to handle data about a resource that does not come from an authoritative source of data for that resource, in order to avoid publishing unvalidated claims about a resource. Examples:
- additional information about Credentials provided as part of an approval list;
- contextualizing information about competencies in an official framework.
We need to handle data that has no authoritative source, so that such data do not overwhelm data from authoritative sources in results displays. For example skills created on an ad hoc basis rather than being part of well managed frameworks.
We need to provide the source and licensing of data that we publish (#741) so that it can be acknowledged and used appropriately.
We need to record provenance for data in the registry, such as who created them, when were they last checked, in order to ensure that the data meets our currrency policy.

Named graphs have potential to meet these use cases. Further discussion in a Google Doc, focusing on the first example of the first use case, but showing how the others might be addressed.

Proposed initial step is to investigate how the Registry software and associated tools might handle named graphs.

stuartasutton commented 2 years ago

While this describes is simple, but elegant sophisticated way of using named graphs. Our data exports have been expressed as named graphs for a long time. So this should not be a big issue, I would not think. As for RDF, I would be very surprised if there weren't people working on a JSON-LD expression. That's probably not a simple task since any current JSON parser can handle JSON-LD. However, RDF may take it to a level that goes beyond conformant JSON which will lengthen the adoption curve.

siuc-nate commented 2 years ago

Thank you for your work in putting together that example, it is pretty thorough and I appreciate the lengths it/you went to to explain things. However, I have some reservations:

We already have a functional implementation in place that we previously agreed to using while we were developing handling for this exact problem back during the development of Transfer Value Profile. That same approach works here as well.
I do not believe it is worth the cost/time/effort to make the changes you describe, as:
- It increases the complexity of our data, needing an additional layer of @graphs
- Using a graph to contain two more graphs that contain two objects that describe the same thing is what we're already doing with bnodes, minus the outer extra layer of graph and the additional overhead described below.
- It breaks the way data is published, where one publisher = 1 envelope = 1 graph = 1 main resource + bnodes (and in the special cases of competency frameworks, concept schemes, and pathways, other top-level resources that are parts of those things)
- It breaks the way data is consumed, which is similarly dependent on those details
- It breaks the rule we have about a CTID being the unique part of both the /resources/ and /graph/ URI
- This, in turn, would require recoding a lot of functionality within the registry dealing with publishing, search indexing, getting data via APIs, etc.
- It would also require recoding all of our systems to handle data where we can't know what one URI or CTID is going to be based on another URI and/or CTID, which would also break things related to publishing and consuming
- It would require rewriting a lot of our documentation, powerpoints, diagrams, etc.
- We would have to retrain all of our partners on the new approach and explain to them why we made the change
- Any partner whose systems are reliant on the current structure of the data and rules about CTIDs/URIs (which we have emphasized in our presentations about them) would have their systems break for no apparent (in their eyes) benefit
- We have many other priorities that are much more pressing, and breaking all of our systems in the middle of trying to handle those would not be good
- Even if we managed to pull all of that off, it would be a significant investment of time/effort/etc. that is only relevant to a tiny percentage of the data we have or anticipate having
If I retrieve the URL (from your example) res:ce-01234567-abcd-abcd-abcd-00000000007, what would I get back? There are two JSON documents with that URI, which seems to do the opposite of what we want.
The idea that something can have multiple different identifiers is valid, and we don't need to engineer a workaround to a problem that isn't really a problem.
For better or worse, our URIs are the URLs of JSON documents, and our systems (and our partners' systems) rely on that fact. Those URIs/URLs are in turn based on CTIDs, which are the URIs for the intangible credential described by those documents. In the vast majority of cases, this is useful, and I don't think we should change it just to accommodate an edge case.
- As such, would it satisfy the semantics you're aiming for if we used the bnode approach we're currently using, but instead of (or perhaps in addition to) the use of sameAs, we indicated that the bnode needs to contain the same CTID as the "real"record? That would still require some changes in both our code and the registry, but I don't think they would be nearly as substantial.
- Alternatively, perhaps some other type of URI instead of a bnode URI?
- Or maybe this kind of "non-authoritative" data could be placed in a different "community" within the registry itself, with full-fledged URIs and such that would clearly indicate they come from a non-authoritative data source?

If we were starting from scratch, then perhaps we could do something different, but the system we have works. I understand and respect the semantic purity (for lack of a better term) that you're describing, but I think the practicalities of implementation (or perhaps more accurately, the impracticalities of changing the implementation) need to win out in this one. That being said, I am open to ideas for improving that implementation that don't involve changing the entire foundation of the way the registry is setup. I am opposed to going to so much effort to reinvent the solution to a problem that is already solved, particularly with everything else we have on our plate.

philbarker commented 2 years ago

Thanks for the reply, Nate; and I hear you on the first two top-level points you make (I did tag this as system-wide impact).

On the middle points (not in order):

The idea that something can have multiple different identifiers is valid, and we don't need to engineer a workaround to a problem that isn't really a problem.

Yes it's valid. But what you say about something using one identifer has to apply whichever identifier you use. You can't attach different semantics just because you use a different identifier.

If I retrieve the URL (from your example) res:ce-01234567-abcd-abcd-abcd-00000000007, what would I get back?

That would depend on policy choices made by the the service you asked. It might vary by context for example: ask for information about a credential and you get what was provided by the owner of that credential; ask for information about about a credential in a CAL and you get what was provided by the creator of the CAL and the owner of the credenitial; ask for everything and you could get everything. The user interface may make it clear what information comes from where.

For better or worse, our URIs are the URLs of JSON documents, and our systems (and our partners' systems) rely on that fact.... based on CTIDs, which are the URIs for the intangible credential described by those documents.

If the CTID were the id of the data stored and the URI the identifier of the intangible I would have no problem. One day it might be worth helping me get to the bottom of what the systems rely on (if it's simply that they need to resolve the URI to get a JSON doc then HTTP 303 redirects are your friend).

The problem is that it is the URI that is used for the @id in a JSON-LD node object. Another problem is that the JSON-LD spec gets tricky when trying to work out what it means by a "node object". On the one hand it says "@id [is] Used to uniquely identify node objects that are being described in the document..." (my emphasis), which suggests it is the identifier of the thing being described; on the other hand it says "A node object represents zero or more properties of a node in the graph" (node & graph are referenced to the RDF spec), which sounds more like a node object is the description, but then again the word "represents" is not the same as the word "is". FWIW, most RDF tools will tak the @id as the identifier of the thing being described. At the root of this is a long debate in the RDF community that I know many people think is extremely unhelpful, so I try not to worry about it...

...but I do think it helps to be clear about what is data and what is the thing the data is about if you want to say that the data is owned by one person and the thing owned by another; or the data and the thing have different creation dates, and so on. It also matters if we want to say that someone has been awarded a credential and its identifier is ... (no they haven't been awarded a description of the credential). That's something we need to sort out for when working with other standards like VC-Edu.

Of the options you present,

"non-authoritative" data could be placed in a different "community" within the registry itself

is the one I would favor. Indeed, I would call all the stuff in that "community" a graph and give it a name.

siuc-nate commented 2 years ago

I think part of the confusion you mention within the spec is that while humans can understand the idea of "a thing" and "a description of that thing" being two separate entities with their own identifiers and so on, the computer can only comprehend the latter. An intangible, notional credential can't be stored in RAM, can't be sent over internet wires, can't be read from or written to, etc. The only "thing" that can is some representation of that thing (ie a JSON document). As such, as far as the code is concerned, the intangible "thing" does not exist, because it cannot exist, because the computer has no way to conceive of it (like trying to use a ruler to describe what something smells like). That leads to JSON, code, and eventually the thinking of developers being oriented around the representation of the thing, because that's all the computer can touch. Then you wind up with somewhat ambiguous documentation that tries to make sense of RDF in a way that works for the developer who is reading the spec in order to make the computer obey it.

What I'm trying to say is: In practical terms, all we can deal with are the representations of things (ie JSON documents). Those are, as far as the computer is concerned, what the URIs truly identify, and what the URLs truly resolve. So it's not that I disagree with the notion of an intangible credential as a separate entity from data; it's just that I can't do anything with that notion, so I focus on the data instead.

That would depend on policy choices made by the the service you asked.

For a plain HTTP GET, the service just needs to return the matching document.

It might vary by context for example: ask for information about a credential and you get what was provided by the owner of that credential; ask for information about about a credential in a CAL and you get what was provided by the creator of the CAL and the owner of the credential; ask for everything and you could get everything.

A simple HTTP GET by URI has no way to distinguish between these, unless you're suggesting that the request needs to have some extra headers, or there should be some custom API involved.

The user interface may make it clear what information comes from where.

For the sake of this example, I'm not talking about an interface - just a "I put the URL in my browser's URL bar and hit enter" situation.

philbarker commented 2 years ago

Computers comprehend nothing, people program them to process data. We use standards so that different people know how to program different systems so that they process data in compatible ways. According to the RDF standard (and linked data defers to this, so if we say we use linked data it is what we should do)

Any IRI or literal denotes something in the world (the "universe of discourse"). These things are called resources. Anything can be a resource, including physical things, documents, abstract concepts, numbers and strings;

https://www.w3.org/TR/rdf11-concepts/#resources-and-statements

When an IRI/URI/URL that represents something that cannot be transmitted over the HTTP is resolved, the HTTP standard says that the service handling the request may respond with a 303 redirect, which "indicates that the server is redirecting the user agent to a different resource".

This is how, in practice, in real large scale implementations, an identifier for a non-information resource can be used to retrieve a JSON doc about that resource. For example:

shuttle:$ curl -I https://www.wikidata.org/entity/Q42
HTTP/2 303 
date: Tue, 30 Nov 2021 14:24:43 GMT
server: mw1366.eqiad.wmnet
location: https://www.wikidata.org/wiki/Special:EntityData/Q42
content-type: text/html; charset=iso-8859-1

For a plain HTTP GET, the service just needs to return the matching document.

shuttle:$ wget https://www.wikidata.org/entity/Q42
--2021-11-30 14:27:20--  https://www.wikidata.org/entity/Q42
Resolving www.wikidata.org (www.wikidata.org)... 91.198.174.192, 2620:0:862:ed1a::1
Connecting to www.wikidata.org (www.wikidata.org)|91.198.174.192|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://www.wikidata.org/wiki/Special:EntityData/Q42 [following]
--2021-11-30 14:27:20--  https://www.wikidata.org/wiki/Special:EntityData/Q42
Reusing existing connection to www.wikidata.org:443.
HTTP request sent, awaiting response... 303 See Other
Location: https://www.wikidata.org/wiki/Special:EntityData/Q42.json [following]
[...]
2021-11-30 14:27:21 (1.94 MB/s) - ‘Q42’ saved [264805]

Q42 is a nice big JSON file of data about Douglas Adams.

So you can do something with the notion that URIs identify resources. The HTTP GET does return the JSON doc, albeit via two redirects: the first to redirect to a data object, the second to redirect to the default representation of the data object. The second redirect can be programmed to a different representation (e.g. HTML) by setting the Accept: parameter in the GET request header. We wouldn't need the Credential Registry to do a second redirect if we always wanted to return JSON-LD on a request for a URI.

A simple HTTP GET by URI has no way to distinguish between these [data from the owner of a Credential and all data about a credential], unless you're suggesting that the request needs to have some extra headers, or there should be some custom API involved.

Imagine we have two sets of data: "trusted" and "everything". Which you serve by default would be a policy decision. We would probably serve the "trusted" data. Whichever data we serve could include a link to retrieve data for the other option. One way to implement this would be for the link to be to the API and to allow a parameter for which graph(s) to query.

Anyway, I still mindful of your response that you don't have time to spend on this; I just want to be sure that you understand that this is a practical, implementable solution. It's not a paper-only solution, it is used at scale. It has advantages over ad hoc inventions that work for the Registry but don't travel when we want people working on other systems to link to our data.

philbarker commented 2 years ago

@stuartasutton wrote

I would be very surprised if there weren't people working on a JSON-LD expression. That's probably not a simple task since any current JSON parser can handle JSON-LD. However, RDF* may take it to a level that goes beyond conformant JSON which will lengthen the adoption curve.

Correct on all counts: https://json-ld.github.io/json-ld-star/

siuc-nate commented 2 years ago

Isn't this effectively just a way to do bnodes without giving them identifiers? Seems like it adds a lot of complexity for a problem that can already be solved relatively easily.

siuc-nate commented 2 years ago

Regarding cases where we're using/proposing to use bnodes to "append" data to "real" records: Would using schema:about be more appropriate than ceterms:sameAs?
Saying "The stuff in this bnode is about this other resource" might be more semantically acceptable than "This bnode is the same as this other resource". It would fall in line with our use of schema:about to connect a DataSetProfile to a Credential/etc.

philbarker commented 2 years ago

@siuc-nate there need be no bnodes. It's a general solution for having data about things and data about the data about things. BNodes do not solve that problem.

philbarker commented 2 years ago

:resource1 sdo:about resource2 would be saying that the resource1 was about resource2. Saying that you have a description set profile about a course is fine; saying that you have a course about a course is odd. I don't think you can get away from the fact that you have two people providing differrent data about the same thing, so you need to be able to describe the data separately from the resource it is about.

philbarker commented 2 years ago

Building on our discussions about identifiying the source of puiblished data in the envelope, and how that relates to named graphs: putting this context into the envelope will turn it into a reasonable first shot at a named graph:

  "@context": [
    "https://credreg.net/ctdl/schema/context/json",
    {
      "@vocab": "http://credreg.net/meta/terms/",
      "envelope_type": {
        "@id": "@type"
      },
      "decoded_resource": "@nest",
      "@base": "https://credentialengineregistry.org/ce-registry/resources/",
      "owned_by": {
        "@type": "@id"
      },
      "published_by": {
        "@type": "@id"
      }
    }
  ],

[In fact just the lines upto and including "decoded_resource" are necessary, but the others usefully link the publisher & owner to the data about them in the registry]

You can see it working in the JSON-LD playground -- chose the "compacted view" in the options at the bottom to see it as reasonably readable JSON LD, with the data graph first followed by all the information about it, "N-Quads" to see triples + the URI of the graph they come from.

Notes: This seems really straightforward to me. Nothing in the JSON envelope need change except for injecting that context block. I'm confident from the NQuad in the playground that it is working, though I haven't yet had the chance to try importing a lot of data into some other RDF datastore to check how it plays (I'll get to that soon I hope).

I've effectively minted URIs for a whole load of terms in the meta namespace, so there will be some work needs doing to define them, but I hope we know what they mean.

The /graph/ URIs effectively become the uris that identify the data provided about a resource, though we can also point to the /envelope/ URIs when we want the data plus its metadata.

philbarker commented 2 years ago

Update: that works in GraphDB, with a slight update to the @context from what I originally posted (I've edited the original post to correct the error).

I took a few records from the registry, added that context and the line "from_primary_source": true, or "from_primary_source": false, to the envelope, imported them into GraphDB and can now run queries like:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX ceterms: <https://purl.org/ctdl/terms/>
PREFIX cemeta: <http://credreg.net/meta/terms/>
select ?s ?name where {
    ?g cemeta:from_primary_source True
    GRAPH ?g {
        ?s ceterms:name ?name 
    }
}

CredentialEngine / Schema-Development

Investigate named graphs #805