airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Correct CURIE format for ontologies #630

Closed bcorrie closed 2 years ago

bcorrie commented 2 years ago

@bussec and @schrsitley it is my understanding that the correct CURIE format is for the CURIE prefix to be defined in the AIRR Spec, and that prefix is a "precise" definition and maps to other possible uses in various IRIs depending on the provider. For example the AIRR CURIE prefix in the spec for the NCBI Taxonomy is NCBITAXON all upper case. So a correct CURIE would be NCBITAXON:9606.

That sometimes maps to NCBITaxon in some providers, but an AIRR CURIE of NCBITaxon:9606 would be incorrect I think? Do I have that correct?

On the upcoming v4.0 release of the Gateway (using AIRR v1.4, ADC API v1.2), we are now searching on Taxonomy IDs for accuracy (rather than the possibly ambiguous label) and currently this is an exact string match. So something with a CURIE prefix that doesn't match the exact CURIE prefix string will not be found. I think this is the correct behaviour but wanted to confirm.

@schristley we are seeing CURIEs with either NCBITaxon:9606 as well as the older style NCBITaxon_9606 on VDJServer. This will cause problems when searching non-compliant ontology fields (assuming these are non-compliant) as the query will not match. Not sure how wide spread this is. Today (iReceptor v3.0) we are searching on the ontology label, and long ago we fixed the mis-matched label fields in all of our repositories. But it looks like we didn't do this for the Ontology IDs? Assuming I am correct above, I am hoping you can correct these???

@schristley we have some ontology checking code (https://github.com/sfu-ireceptor/sandbox/tree/master/ontology-check) that you can use to check your repository for cases like this. It did indeed find some non-compliant ontology CURIEs:

Lots of compliant lines deleted

Info: Processing repertoire 2138356813095506412-242ac114-0001-012
Info: Processing repertoire 2138528607492379116-242ac114-0001-012
Info: Processing repertoire 2130325219957019116-242ac114-0001-012
Info: Processing repertoire 2648490830777881066-242ac113-0001-012
ERROR: Curie prefix NCBITaxon not in IRI list
Info: Processing repertoire 2669106673798681066-242ac113-0001-012
ERROR: Curie prefix NCBITaxon not in IRI list
Info: Processing repertoire 2686329492655641066-242ac113-0001-012
ERROR: Curie prefix NCBITaxon not in IRI list
Info: Processing repertoire 2700631733751321066-242ac113-0001-012
ERROR: Curie prefix NCBITaxon not in IRI list
Info: Processing repertoire 2714891025174041066-242ac113-0001-012
ERROR: Curie prefix NCBITaxon not in IRI list

Killed the test after found some issues...

It looks like there are probably a couple of studies that maybe have the different CURIEs?

bussec commented 2 years ago

@bcorrie Your summary is correct, just note that the ALLCAPS prefix is an AIRR convention, it is not part of the W3C TR. Futhermore, the TR does not make any statements on case-insensitivity, therefore I assume that this means that prefix-matching is case-sensitive. As the key purpose of using CURIEs is to abstract the actual IRI we should use only one capitalization-scheme per prefix.

javh commented 2 years ago

What's the action item on this? Is there anything for v1.4 here?

bussec commented 2 years ago

@javh No, this is an issue of a repository in the ADC, IMO the expected standard is clear. If not, we need more documentation, but that's something for v2.0.

schristley commented 2 years ago

nothing to be done by airr-standards

schristley commented 2 years ago

@javh No, this is an issue of a repository in the ADC, IMO the expected standard is clear. If not, we need more documentation, but that's something for v2.0.

validating resolution of ontology terms within a DataFile would be a nice option/enhancement.

bcorrie commented 2 years ago

I agree no action for v1.4 - this was pre the rule about no issues until 1.4 is released (I think 8-)

We can reuse our code in the AIRR python library if we want, and make it an option. Just to point out that it does actually query OLS and check the ontology terms, so it is a pretty substantial operation...

Also, it does strict checking in the sense that it checks the label from OLS for an exact string match, so it does not consider label aliases and does not accept case differences. So a label of "Human" or "homo sapiens" for "NCBITAXON:9606" will fail because the strict match is "Homo sapiens"