airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

OntoVoc Sprint 04/21 #524

Closed bussec closed 2 years ago

bussec commented 3 years ago
bcorrie commented 3 years ago

@bussec wondering what the rationale is for having a map and a provider and how CURIEMap is intended to work with InformationProvider.

If I understand correctly, CURIEMap with the map field essentially replicates what we used to have. It allows you to resolve a CURIE with a resolved IRI to get a mapping...

Is the intent of the provider to denote services that can provide structured, machine readable responses, as opposed to map which provide basic lookup and takes you to a web page?

bcorrie commented 3 years ago

If I am understanding things correctly, it looks like ROR in InformationProvider is incorrect. It is:

    ROR:
      request:
        url: "https://api.ror.org/organizations/{iri}"
        response: application/json

But I think it should be:

    ROR:
      request:
        url: "https://api.ror.org/organizations/{local_id}"
        response: application/json

Similar to the ORCID entry.

bcorrie commented 3 years ago

And I think ORCID returns XML, not JSON.

Holding off making any changes until I confirm I understand correctly 8-)

bussec commented 3 years ago

@bcorrie Trying to address all of your questions below:

  1. Intention of the two data structures: In general, your interpretation is correct, however retrieving human-readable information via the IRI directly is not part of the specification:
    • CURIEMap is supposed to expand a CURIEs to a full IRI, which can serve as unique Identifier of a concept/instance.
    • InformationProvider is supposed to specify how a machine-readable record of the concept/instance can be retrieved. Preference will be given to JSON format, but this might not always be available (indicated via request.response). Unless defined otherwise, it is assumed that the final URI will be used to perform an HTTP GET request without any additional information.
    • The fact the IRIs are resolvable by DNS and will take you to a webpage is at this point considered to be beyond the scope of the schema. We could however introduce a flag that indicates that this is the case. I had such a flag in the previous commits, but I removed it to simplify the structure.
  2. ROR API Requests: The ROR API supports both IRI and local ID as query inputs. The request URI with an IRI at the end looks a bit like a copy-paste gone wrong, but it is according to the ROR specs. If you prefer a less confusing URI we could also use https://api.ror.org/organizations?query={iri} as request.url. I would prefer to use IRIs instead of local IDs whenever possible as it is closer to the way how they are used in linked data.
  3. ORCID API Responses: The ORCID API supports multiple response formats and defaults to XML. However, you can select JSON output by passing the respective MIME type via the Accept field in the HTTP header. I assume that this is the perferred format for us.
bcorrie commented 3 years ago

@bussec do you want to have a look at my changes and provide some feedback. In particular, wondering about the terminology in the intro section where I talk about CURIEMap and InformationProvider

bussec commented 3 years ago

@bcorrie Thanks, looks good. I realized that with the clear split between CURIE mapping and data retrieval, the term provider has become fuzzy, as it can either refer to an

Maybe Authority is the better term for the former one... will ponder upon this when writing the report.

schristley commented 3 years ago

@bussec Can I merge this PR, or still working on it?

bussec commented 3 years ago

@schristley No, I am still working on it... hope to complete it by the end of the week.

schristley commented 3 years ago

IEDB has released a beta query API for their database.

Of interest is they have a curie_map endpoint which returns a structure similar to our CURIEMap. I've noticed however that the IRI goes to the human-readable html page instead of the API which returns machine-readable JSON. For example, IEDB_EPITOPE: 7355 resolves to:

https://www.iedb.org/epitope/7355

versus

https://query-api.iedb.org/epitope_search?structure_id=eq.7355

The nice thing is that now we should be to add a single field in Rearrangement if we want to link to an epitope #44

Though that opens the question on whether we should put entries into our CURIEMap for resolving IEDB_EPITOPE or if we should use IEDB's?

We also can consider how we might link with receptors. The IEDB API has tcr_search and bcr_search endpoints:

https://query-api.iedb.org/tcr_search?limit=5
https://query-api.iedb.org/bcr_search?limit=5

The receptor ids have IEDB_RECEPTOR as their CURIE though oddly it's missing from the above curie_map. However, we might want to consider how we can link our Receptor to IEDB's.

bcorrie commented 3 years ago

This is very nice... Should we take this up on #44 as to if/how to add this?

bcorrie commented 3 years ago

Cool, if I find a CDR3 of interest in an AIRR-seq data set, I can ask IEDB if it has any known antigen specificity...

curl https://query-api.iedb.org/tcr_search?receptor_chain2_cdr3_seq=eq.ASSPPGLSQSYGYT

bussec commented 3 years ago

@bcorrie @schristley

This is very nice... Should we take this up on #44 as to if/how to add this?

44 is about reactivity/epitopes so we can discuss the questions related to these points there. For the issues revolving around IDs for Receptor objects I now created #540.

bussec commented 2 years ago

Note to self: Make sure to that #465 is included in here (or at least not in conflict.

schristley commented 2 years ago

Recent discussion at the json schema org about ontologies, including a reference to us. The Human Cell Atlas example is interesting as it support multiple ontologies and even specifies the relation,

bcorrie commented 2 years ago

@bussec are we comfortable with the CurieMap and InformationProvider objects in the Spec? We are working on an ADC ontology checker that will use the above to validate a NCBITAXON:9606 style of CURIE in a repository. I don't want to develop code against something that is going to change dramatically. I will probably restrict the ontology checker to use the OLS provider (at least for now).

bussec commented 2 years ago

@bcorrie IMO yes... I assume you could cope with a field still changing its name as long as the overall structure is not affected, correct?

bcorrie commented 2 years ago

@bcorrie IMO yes... I assume you could cope with a field still changing its name as long as the overall structure is not affected, correct?

Minor changes good, major changes bad... 8-)

bcorrie commented 2 years ago

One question - the CurieMap and InformationProvider objects are not "Objects" defined in the same way that the other objects are - that is with a

CURIEMap:
  discriminator: AIRR
  type: object
  properties:
...

If we have the above for both of these objects, then when you use the AIRR python library, you automatically can access these objects using the AIRR Schema class. That is if you include the AIRR library you can do this:

# Import AIRR Schema class
from airr.schema import Schema

# Get the schema object for CURIEMap
curiemap_schema = Schema('CURIEMap')

# Process the object as you see fit - in this case get the IRIs
for curie_prefix, values in curiemap_schema.properties.items():
        if values['type'] == 'ontology' or values['type'] == 'taxonomy':
            ontology_iri_dict[curie_prefix] = values['map']['OBO']['iri_prefix']```

No handling of AIRR Spec files and processing them - the AIRR library does it for you.

The problem is that the AIRR library expect this basic form for all AIRR Spec objects.

This is simple to add (I already have in my local copy) and I can push if you agree...

bcorrie commented 2 years ago

I just wrote some very basic code to check Ontology labels, based on how the AIRR Spec defines ontologies. It uses the AIRR python library and the modified (as above) CurieMap and InformationProvider to build a OLS/OBO query, and then checks the results: https://github.com/sfu-ireceptor/sandbox/tree/master/ontology-check

$ python3 airr-onotlogy.py NCBITAXON:9606 "homo sapiens"
ERROR: Invalid CURIE/label: NCBITAXON:9606, homo sapiens, correct label = Homo sapiens
$ python3 airr-onotlogy.py NCBITAXON:9606 "Homo sapiens"
Valid CURIE and label: NCBITAXON:9606, Homo sapiens
$ python3 airr-onotlogy.py DOID:0080600 COVID-19
Valid CURIE and label: DOID:0080600, COVID-19

First steps toward an ADC Ontology checker... 8-)

schristley commented 2 years ago

One question - the CurieMap and InformationProvider objects are not "Objects" defined in the same way that the other objects are - that is with a

@bcorrie That's correct. The other AIRR objects are schema definitions, kind of like a Class in OO-programming while CURIEMap and InformationProvider are instances, that is they contain actual data instead of defining the structure. We can have a schema definition for CURIEMap, it would look something like this. It looks a bit odd because the keys are not pre-defined (thus the additionalProperties). But remember, you'd still another need object that actually holds the data.

CURIEMap:
  discriminator: AIRR
  type: object
  additionalProperties:
    type: object
    properties:
      type:
        type:string
      default:
        type: object
        properties:
          map:
            type:string
          provider:
            type:string
      map:
        type: object
        additionalProperties:
          type: object
          properties:
            iri_prefix:
              type: string

No handling of AIRR Spec files and processing them - the AIRR library does it for you.

The objects are similar to Info, so they do need to be handled explicitly by the AIRR library, regardless of whether a schema definition is defined or not.

It would be nice to add a simple resolve-like function to the AIRR library so users don't have to write their own, at least if they use R or python. That would also help insulate users from changes.

schristley commented 2 years ago

There's actually an issue to validate ontology fields #503 which would imply that the AIRR library knows how to resolve the CURIE to check.

bcorrie commented 2 years ago

There's actually an issue to validate ontology fields #503 which would imply that the AIRR library knows how to resolve the CURIE to check.

I think #503 is to check the validity of a CURIE in terms of its format - that is, it is colon separated and the CURIE prefix exists in the schema, but not whether the CURIE itself is a valid CURIE for that Ontology, correct?

The code I have can do that, and we can probably reuse some of it in the AIRR Library, but my goal with the code above is to create a CURIE checker for the content of an ADC repository.

bcorrie commented 2 years ago

@bcorrie That's correct. The other AIRR objects are schema definitions, kind of like a Class in OO-programming while CURIEMap and InformationProvider are instances, that is they contain actual data instead of defining the structure.

Right, I remember having this discussion earlier, thanks for the reminder - perhaps we should document that in the Spec itself so it is clear to people like me who forget 8-) I wonder - should we have object definitions for these object instances - then we could use the AIRR schema to confirm the validity of those objects?

bcorrie commented 2 years ago

@bussec did you do some sort of weird merge/push (force-push) above... When I do a pull of my local copies of the ontovoc-4-21 branch are complaining that I have some minor conflicts in a doc file. When I fix the 1 conflict file, it says I have 17 local changes - when I haven't actually changed anything.

This is on a Linux box, but I had the same situation when I tried to access this on my desktop through Sourcetree...

bussec commented 2 years ago

@bcorrie Yes, I rebased the ontovoc-04-21 branch onto the current tip of master to reduce potential conflicts as this branch hasn't seen any work for 8 months. This should however not create any conflicts on your side, unless you had changes that were not pushed yet. Please PM me the details.