airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

do we use the decentralized identifiers spec and/or parts of it for our needs, or something simpler? #563

Open schristley opened 3 years ago

schristley commented 3 years ago

@bussec found some discussion on Mozilla that criticizes the DID spec.

I agree with one point by the reviewer (and haven't really investigated the other ones) in that the DID still requires a centralized registry. Specifically, with an id like did:NAME:SOMETHING, the NAME part has to put in the central registry to reserve and avoid conflict. In a way, this is somewhat unavoidable. Take for example ontologies, there's no central registry but there is still an (informal/formal?) mechanism so that new ontologies don't reuse and existing name like "DOID" for disease ontology, "GO" for gene ontology, etc.

However, let's step back and consider what we need. In my opinion, what we primarily need is a way for an "issuer" of identifiers to be able to "control" the resolving of the identifier. That is something that CURIE does not support, or at least I don't think it does according to how we implement it. Let me know if I'm missing something?

Let me pose a simple example, given:

Currently, with CURIE, the client program controls resolving the identifier. It does it with 4 simple steps:

This work well for simple services like ontologies, but isn't very flexible or scalable for complex servers. For example, VDJServer issues and controls identifiers for many different types of data, e.g. studies, repertoires, germline sets, etc. With CURIEs, VDJServer has these requirements:

This isn't very scalable or flexible. The solution, which is one thing I like about DID, is that it shifts control of resolving from the client to the server. It does this by introducing an additional step in the process. In particular, it operates like CURIE but the constructed URL is for the "DID Method", and issuing a request to that URL doesn't return the data but instead returns a "DID Document". That "DID Document" contains within it the URL to retrieve that data but can also contain additional information (actually this part is a little vague in the DID spec as I mention at the bottom).

Let's take the same example but now with a specific identifier.

The process is similar to CURIE for the client, except it recognizes did in the identifier and performs a URL lookup to the "DID Controller"

Now VDJServer has to be running a service which can receive the incoming requests.

Finally, the client has a "DID Document" specifically constructed for that identifier.

Some observations:

After all this, I find some parts of the DID spec to be opaque and not easy to understand. In particular, the latter step of retrieving the actual data is not well specified, in particular the DID spec says, "The details of how this process is accomplished are outside the scope of this specification..."

Lastly, this capability is my main interest regarding the DID spec. One option is to not consider the DID spec as a whole, but just take this portion.

bcorrie commented 8 months ago

This is also a big, complicated issue, and I don't think this will be resolved for v2.0. Can we remove the v2.0 tag @schristley @bussec?

schristley commented 8 months ago

It is unfortunate that we cannot resolve the identifier issue by v2.0 but I agree at this point it won't happen without rushing. This is going to become an important issue for the AKC though as provenance/validation and the ability for evidence chains to unambiguously reference records in source repositories will be needed. My suggestion is that all identifier related issues be moved to v2.1

bussec commented 7 months ago

I also agree that it is an important issue, especially when thinking about knowledge graphs and how to reliably link information together. I am still not convinced that DID are the best way to go as (in my perception) they mainly try to address identifiers for persons, not for objects. In addition, IMO the feature of the issuer controlling the resolving process has been addressed by indirect means (i.e., DNS resolvers and HTTP redirects), which seems to work ok for DOI (or Handles in general), although I agree that protocols with better transparency would be preferable.

I am ok with removing this from the v2.0 list, whether this is an v2.1 or a v3.0 issue depends on the question whether this is "just" about identifiers in repositories or whether we would restructure the Schema in general to accommodate DID (or other ids).