do we use the decentralized identifiers spec and/or parts of it for our needs, or something simpler?

schristley commented 3 years ago

@bussec found some discussion on Mozilla that criticizes the DID spec.

I agree with one point by the reviewer (and haven't really investigated the other ones) in that the DID still requires a centralized registry. Specifically, with an id like did:NAME:SOMETHING, the NAME part has to put in the central registry to reserve and avoid conflict. In a way, this is somewhat unavoidable. Take for example ontologies, there's no central registry but there is still an (informal/formal?) mechanism so that new ontologies don't reuse and existing name like "DOID" for disease ontology, "GO" for gene ontology, etc.

However, let's step back and consider what we need. In my opinion, what we primarily need is a way for an "issuer" of identifiers to be able to "control" the resolving of the identifier. That is something that CURIE does not support, or at least I don't think it does according to how we implement it. Let me know if I'm missing something?

Let me pose a simple example, given:

an identifier
client program that wishes to resolve the identifier into data
server program that issued the identifier and controls the data for that identifier.

Currently, with CURIE, the client program controls resolving the identifier. It does it with 4 simple steps:

Extract the prefix, e.g. "DOID"
Lookup the prefix in CURIEResolution in the AIRR spec.
Construct a URL from the CURIEResolution and the identifier.
Issue request to URL to retrieve data.

This work well for simple services like ontologies, but isn't very flexible or scalable for complex servers. For example, VDJServer issues and controls identifiers for many different types of data, e.g. studies, repertoires, germline sets, etc. With CURIEs, VDJServer has these requirements:

There needs to be a bunch of prefixes for the different data types: VDJSERVER_STUDY, VDJSERVER_THIS, VDJSERVER_THAT, which all need to be listed in CURIEResolution.
Every time VDJServer decides to add another data type, the AIRR spec needs to be updated (and published) to be usable.
VDJServer cannot control how the ID is turned into a resolvable URL. It has to organize its APIs to conform to the CURIE spec, in order to be resolvable according to the simple steps above.

This isn't very scalable or flexible. The solution, which is one thing I like about DID, is that it shifts control of resolving from the client to the server. It does this by introducing an additional step in the process. In particular, it operates like CURIE but the constructed URL is for the "DID Method", and issuing a request to that URL doesn't return the data but instead returns a "DID Document". That "DID Document" contains within it the URL to retrieve that data but can also contain additional information (actually this part is a little vague in the DID spec as I mention at the bottom).

Let's take the same example but now with a specific identifier.

a decentralized identifier: did:VDJSERVER:SOMETHING

The process is similar to CURIE for the client, except it recognizes did in the identifier and performs a URL lookup to the "DID Controller"

Recognize did and the DID Method as VDJSERVER
Lookup URL for the DID Method in CURIEResolution (or maybe DIDResolution?) for VDJSERVER
Send request to URL with the decentralized identifier.

Now VDJServer has to be running a service which can receive the incoming requests.

VDJServer receives the request, interprets the SOMETHING part of the identifier, and return the appropriate "DID Document"

Finally, the client has a "DID Document" specifically constructed for that identifier.

Client receives "DID Document"
Send a request using the information in "DID Document" to get the actual data.

Some observations:

The client doesn't interpret the identifier beyond recognizing VDJSERVER
The process is the same for the client for all decentralized identifiers.
There's a single entry point provided by VDJServer to "resolve" all identifiers. That's easy to manage with just a single persistent entry in the AIRR spec. New data types could be added at any time without changing the AIRR spec.
The actual URL to get the data can be created dynamically if desired. For example, VDJServer could generate single-use URLs to retrieve data.
The "DID Document" can contain additional information. For example, it can describe what data formats are available so the client could request JSON or TSV formats.

After all this, I find some parts of the DID spec to be opaque and not easy to understand. In particular, the latter step of retrieving the actual data is not well specified, in particular the DID spec says, "The details of how this process is accomplished are outside the scope of this specification..."

Lastly, this capability is my main interest regarding the DID spec. One option is to not consider the DID spec as a whole, but just take this portion.

bcorrie commented 9 months ago

This is also a big, complicated issue, and I don't think this will be resolved for v2.0. Can we remove the v2.0 tag @schristley @bussec?

schristley commented 9 months ago

It is unfortunate that we cannot resolve the identifier issue by v2.0 but I agree at this point it won't happen without rushing. This is going to become an important issue for the AKC though as provenance/validation and the ability for evidence chains to unambiguously reference records in source repositories will be needed. My suggestion is that all identifier related issues be moved to v2.1

bussec commented 9 months ago

I also agree that it is an important issue, especially when thinking about knowledge graphs and how to reliably link information together. I am still not convinced that DID are the best way to go as (in my perception) they mainly try to address identifiers for persons, not for objects. In addition, IMO the feature of the issuer controlling the resolving process has been addressed by indirect means (i.e., DNS resolvers and HTTP redirects), which seems to work ok for DOI (or Handles in general), although I agree that protocols with better transparency would be preferable.

I am ok with removing this from the v2.0 list, whether this is an v2.1 or a v3.0 issue depends on the question whether this is "just" about identifiers in repositories or whether we would restructure the Schema in general to accommodate DID (or other ids).

airr-community / airr-standards

do we use the decentralized identifiers spec and/or parts of it for our needs, or something simpler? #563