airr-knowledge / issues

Issues and project management for the AKC
0 stars 0 forks source link

CURIE conundrum #32

Open schristley opened 4 months ago

schristley commented 4 months ago

Let's start with the assumption that we store the whole CURIE in the database. For illustration, let's use repertoire in the ADC. Say the repertoire_id for a repertoire in VDJServer's ADC repository is VDJSERVER:12345. There is an ADC API endpoint to query a repertoire based upon its ID. If you construct the URL yourself, it’s looks like this:

curl https://vdjserver.org/airr/v1/repertoire/VDJSERVER:12345

But what if you use CURIE resolution? Well first we define the prefix (and/or something more sophisticated like what AIRR has) for VDJSERVER, it would be something like this:

VDJSERVER: https://vdjserver.org/airr/v1/repertoire/

but now the CURIE resolution doesn't give us the proper URL, instead it gives this:

curl https://vdjserver.org/airr/v1/repertoire/12345

Now you could do a little hack with the prefix, like so:

VDJSERVER: https://vdjserver.org/airr/v1/repertoire/VDJSERVER:

and that will get you the right URL. But this only works if you use a CURIE style with PREFIX:ID. If you want the ID to also have a path within it, so the PREFIX can be the whole server instead a specific endpoint, like VDJSERVER:repertoire/VDJSERVER:12345 for repertoire and VDJSERVER:rearrangement/VDJSERVER:6789 for rearrangements, then this makes things even worse.

VDJSERVER: https://vdjserver.org/airr/v1/

Now that seems to work for CURIE resolution like so:

curl https://vdjserver.org/airr/v1/repertoire/VDJSERVER:12345

but in fact, that doesn't work because the repertoire_id stored in the database is VDJSERVER:repertoire/VDJSERVER:12345 now.

How to solve this conundrum?

jamesaoverton commented 4 months ago

I might not fully understand. It seems to me that you have two URIs in this example:

  1. https://vdjserver.org/airr/v1/repertoire/VDJSERVER:12345
  2. https://vdjserver.org/airr/v1/rearrangement/VDJSERVER:6789

It would be natural to define two prefixes:

  1. repertoire https://vdjserver.org/airr/v1/repertoire/VDJSERVER:
  2. rearrangement https://vdjserver.org/airr/v1/rearrangement/VDJSERVER:

Then you have these two CURIEs:

  1. repertoire:12345
  2. rearrangement:6789
schristley commented 4 months ago

@bcorrie Any thoughts? Relevant to this

schristley commented 4 months ago

It would be natural to define two prefixes:

  1. repertoire https://vdjserver.org/airr/v1/repertoire/VDJSERVER:
  2. rearrangement https://vdjserver.org/airr/v1/rearrangement/VDJSERVER:

Then you have these two CURIEs:

  1. repertoire:12345
  2. rearrangement:6789

That probably makes more sense than to have a single prefix for all object types that can be resolved by the same server. I suppose I was thinking about the CURIE syntax which seems to indicate that you can have / in the ID part, and I was trying to contrive an example.

jamesaoverton commented 4 months ago

Yes, you're allowed to have / in the suffix part of a CURIE, but a word of warning: Turtle and SPARQL syntax use QNames, not CURIEs. QNames look the same for most cases but they are not as general, and do not allow / in the suffix. (QNames are the element names for XML with namespaces, where a / would not make sense.) For more see https://en.wikipedia.org/wiki/CURIE

So it's better to stick to the QName subset if possible.

bcorrie commented 4 months ago

This is also related to the discussion here:

This is why I was suggesting that we use OGRDB_GERMLINESET as the prefix as with resolving:

OGRDB_GERMLINESET:G00003-1 with "OGRDB_GERMLINESET=https://ogrdb.airr-community.org/api/germline/set?"

If we wanted an allele CURIE we would use something like:

OGRDB_ALLELE:A000001 with something like "OGRDB_ALLELE=https://ogrdb.airr-community.org/api/germline/allele?"

So different CURIE endpoints for the two types of object, similar to the repertoire and rearrangement you have above.

bcorrie commented 4 months ago

In the AIRR Standard we have the CURIEMap concept: https://github.com/airr-community/airr-standards/blob/a8c853266bd6a26e91acd722af0ec5034b42fc7a/specs/airr-schema.yaml#L29

Documented here: https://docs.airr-community.org/en/stable/ontovoc/introduction_ontovoc.html#ontology-data-representation

This provides some details on mapping CURIEs and how to look up "entities" identified with a CURIE using "providers".

schristley commented 4 months ago

This is why I was suggesting that we use OGRDB_GERMLINESET as the prefix as with resolving:

OGRDB_GERMLINESET:G00003-1 with "OGRDB_GERMLINESET=https://ogrdb.airr-community.org/api/germline/set?"

If we wanted an allele CURIE we would use something like:

OGRDB_ALLELE:A000001 with something like "OGRDB_ALLELE=https://ogrdb.airr-community.org/api/germline/allele?"

So different CURIE endpoints for the two types of object, similar to the repertoire and rearrangement you have above.

Yes, I agree now. I was thinking about it the wrong way.

I guess we could debate if we want long prefixes like that or something shorter, but regardless it sounds like William doesn't really want to use a CURIE, which requires resolution, and would prefer to just stick in the IRI, which I think should be acceptable. However, I guess that's up to the standards group on if they want a specific policy.

schristley commented 4 months ago

In the AIRR Standard we have the CURIEMap concept

Yes, and if we move ADC to this, for each endpoint, then we would need to added prefixes like IPA1_REPERTOIRE, IPA1_REARRANGEMENT, VDJSERVER_REPERTOIRE, and so on for all of the ADC repositories and object types, and those become the repertoire_id, sequence_id, and so on for all of the ADC objects.

bcorrie commented 4 months ago

I think it would be nice to have ADC_REARRANGEMENT and ADC_REPERTOIRE and have a service that looked these up across the multiple services in the ADC. Back in iR+ we talked about an ADC aggregator - but we didn't really have funding for that. Maybe that is something the AKC might do?

bcorrie commented 4 months ago

Currently, the AIRR CURIEMap can have multiple providers (e.g. vdjserver, ipa1, ...), but the intent of the CURIEMap is that all providers are supposed to resolve to the same thing - they are different ways of looking up a CURIE like X:Y where Y would be found on all providers. At least that is my understanding.

schristley commented 4 months ago

Currently, the AIRR CURIEMap can have multiple providers (e.g. vdjserver, ipa1, ...), but the intent of the CURIEMap is that all providers are supposed to resolve to the same thing - they are different ways of looking up a CURIE like X:Y where Y would be found on all providers. At least that is my understanding.

Yes, and that make sense if they are providing the exact same thing, like two different services providing the exact same ontology term like OBI:0000181. But in the ADC case, each repository has their own set of unique repertoires, so you wouldn't expect to ask IPA1 for a VDJServer repertoire and vice versa.

schristley commented 4 months ago

I think it would be nice to have ADC_REARRANGEMENT and ADC_REPERTOIRE and have a service that looked these up across the multiple services in the ADC. Back in iR+ we talked about an ADC aggregator - but we didn't really have funding for that. Maybe that is something the AKC might do?

Possibly, we'd have to talk more about how that would work. I think there would still need to be some way to know the original source of the data record. For example, if the ID is ADC_REPERTOIRE:12345, then the aggregator would have to somehow know (presumably by keeping a mapping), that repertoire came from covid-1 repository and what repertoire_id to use, which I guess wouldn't necessarily be 12345.

Unless you are thinking something different? The AKC will definitely want to maintain links back to the source data repositories that are being integrated, but whether it then publishes new IDs for them is something to think about.

bcorrie commented 4 months ago

I think conceptually that is what we are looking for though, in both the ADC and the Germline cases. That is, there is some globally unique identifier for a Repertoire (e.g. 5ed6859e99011334ac05e847 ) and we want to be able to say ADC_REPERTOIRE:5ed6859e99011334ac05e847 as a CURIE to denote that it can be found in the ADC as a "Data Commons".

This is similar to me saying I am interested in DOI:10.1111/imr.12666 - which sends me to https://doi.org/10.1111/imr.12666, which is really https://onlinelibrary.wiley.com/doi/10.1111/imr.12666

What I am saying is that ADC_REPERTOIRE:5ed6859e99011334ac05e847 sends me to the aggregator adc.airr-community.org/airr/v1/repertoire/65e53410d19cbac6daa4d0d6 - which in turn sends me to https://covid19-1.ireceptor.org/airr/v1/repertoire/5ed6859e99011334ac05e847

If I put in a VDJServer repertoire, it takes me to VDJServer in the same way. The aggregator figures out where the unique ID is actually resolving to, just like doi.org does...

I think it could be similar for Germline. We want GERMLINESET:G00003-1 to take you to germline.airr-community.org/germline/G00003-1 which in turn would redirect you to https://ogrdb.airr-community.org/api/germline/set?G00003-1

But if the user tried to resolve GERMLINEST:I000030-1 it might end up at IMGT. The aggregator figures out where the entity is. This is assuming IMGT followed the protocols and query API.

bcorrie commented 4 months ago

But in the ADC case, each repository has their own set of unique repertoires, so you wouldn't expect to ask IPA1 for a VDJServer repertoire and vice versa.

But it is OK to ask both IPA1 and VDJServer if they have repertoire 5ed6859e99011334ac05e847, no? Only one of them will have it. And in the above scenario, if you asked adc.airr-community.org it would redirect you to where the repertoire is.

In both cases it is possible to discover/find/access ADC_REPERTOIRE:5ed6859e99011334ac05e847, but there is more work required if we don't have a central aggregator. Strictly speaking this isn't a CURIE as it can't be resolved by a single IRI, but conceptually it is describing what we want.

bcorrie commented 4 months ago

If I take liberties with the AIRR CURIEMap. I added an array of providers, probably not the correct syntax, but you get the idea.

CURIEMap:
    ADC_REPERTOIRE:
        type: identifier
        default:
            map: ADC_REPERTOIRE
            provider: ADC_REPERTOIRE
        map:
            ADC_REPERTOIRE:
                iri_prefix: "https://adc.airr-community.org/airr/v1/repertoire"

InformationProvider:
    provider:
        ADC_REPERTOIRE:
            - url: "https://covid19-1.ireceptor.org/airr/v1/repertoire/"
               response: application/json
            - url: "https://covid19-2.ireceptor.org/airr/v1/repertoire/"
               response: application/json

ETC

            - url: "https://vdjserver.org/airr/v1/repertoire/"
               response: application/json

The intent suggesting that the above providers can be searched for a CURIE that is prefixed with ADC_REPERTOIRE and that the map IRI can "resolve" the CURIE.

Different than the current use in the AIRR spec, but does capture what we are looking for.

schristley commented 4 months ago

I feel breaking the strict CURIE meaning will cause problems down the road, though it's a neat technical trick.

I am receptive to the idea of hiding the decentralized nature of the ADC with global IDs but that should be done with an aggregator service. Right now that's something for AIRR and CRWG to consider, and not in the AKC scope. AKC is going to have so many of these identifiers that it won't want to have to treat any of them specially, and will want to assume that a simple resolution procedure will give an IRI that returns that data.

One note on an aggregator service implementation. However the IDs are done, you would want the translation to the source IRI to be easy. If you had to maintain a mapping record for every single identifier value then that becomes somewhat crazy with billions of rearrangements IDs...

schristley commented 3 months ago

It would be natural to define two prefixes:

  1. repertoire https://vdjserver.org/airr/v1/repertoire/VDJSERVER:
  2. rearrangement https://vdjserver.org/airr/v1/rearrangement/VDJSERVER:

Then you have these two CURIEs:

  1. repertoire:12345
  2. rearrangement:6789

The unfortunate part about this is it creates an explosion of prefixes. Say with the AKC, we will provide many objects for external identification like investigations, participants, assays, etc. We need to create individual prefixes for all of them. That could possibly be in the hundreds.

@bcorrie This is reminding me now why I was looking at the decentralized identifiers. Because it allows for a single prefix, like OGRDB: instead of a prefix for each object type.

The issue is that requires a two-step resolution process. Instead of OGRDB_GERMLINESET:G00003-1 and OGRDB_ALLELE:A000001, you have OGRDB:GERMLINESET:G00003-1 and OGRDB:ALLELE:A000001.

The first step is the normal CURIE resolution with prefix OGRDB=https://ogrdb.airr-community.org/resolver giving

curl https://ogrdb.airr-community.org/resolver/GERMLINESET:G00003-1
curl https://ogrdb.airr-community.org/resolver/ALLELE:A000001

The resolver end point doesn't return the actual object, instead it does the second step of the resolution and resolves the identifiers GERMLINESET:G00003-1 and ALLELE:A000001 then returns the actual URLs.

https://ogrdb.airr-community.org/api/germline/set/G00003-1
https://ogrdb.airr-community.org/api/germline/allele/A000001

Conceptually, this might not seem any different from the aggregator we were talking about above, because you do need that resolver service. However, the difference is the resolution program does the two-step process.

The advantage of the resolver though is mapping of the identifier to the actual URL, i.e. the second step in the process, is performed by the resolver and can be more sophisticated, if desired, than a simple string concatenation.

jamesaoverton commented 3 months ago

I see the advantage of decentralized identifiers. I don't see the harm in having a few dozen prefixes.

The CURIE grammar is very permissive, so as far as I can tell OGRDB:GERMLINESET:G00003-1 is a valid CURIE. It is not a valid QName, so it will not work in Turtle or SPARQL. So if you use : or / characters then you will not be able to directly convert to Turtle or SPARQL. Maybe that's not a problem here.

My opinion is that this nested CURIE approach is "too clever". I expect bugs where the CURIE is only expanded once rather than twice, and I'm worried about ambiguity when compressing an IRI to a CURIE.

This is not my decision to make, but in my opinion these are the three options from best to worst:

  1. many prefixes: OGRDB_GERMLINESET:G00003-1, OGRDB_ALLELE:A000001
  2. fewer prefixes and "paths": OGRDB:germline/set/G00003-1, OGRDB:germline/allele/A000001
  3. fewer prefixes with nesting:OGRDB:GERMLINESET:G00003-1, OGRDB:ALLELE:A000001

If none of these are satisfactory, just store the full IRI.

schristley commented 2 months ago

My opinion is that this nested CURIE approach is "too clever". I expect bugs where the CURIE is only expanded once rather than twice, and I'm worried about ambiguity when compressing an IRI to a CURIE.

Yeah, and we probably really shouldn't call it a CURIE as that adds confusion.

If none of these are satisfactory, just store the full IRI.

That's true, and if we are going to have lots and lots of prefixes just to use a CURIE, that sound more work/maintenance than to just use an IRI.

schristley commented 2 months ago

When the data is integrated into the AKC, the AIRR identifiers should be converted.