bmir-radx / radx-project

This repo serves as a primary location for tracking issues that don't quite fit into our other dedicated repositories
0 stars 0 forks source link

Support looking up terms by OBO Id #46

Open matthewhorridge opened 7 months ago

matthewhorridge commented 7 months ago

The RADx Data Dictionary Specification allows ontology terms to be provided for Data Elements and Enumerations. These terms can be provided in a short form as OBO Ids. For example, NCIT:C16670.

When I search for terms in BioPortal I get no results, e.g.

image_720

This should return the exact match of the term with the id in first position.

alexskr commented 7 months ago

BioPortal doesn't have OBO version of the NCIT at the moment and owl version useThesaurus:C16670 form of ID; however, search still fails for it

matthewhorridge commented 7 months ago

Please could we use NCIT:CXXXXX as a synonym for Thesaurus:CXXXXX?

In general the pattern is {OntologyAcronym}:{TermCode}. The ontology with {OntologyAcronym} as it's acronym should always be the top hit, ignoring all other metrics.

The more I work with these kinds of terms and searches for them the more critical I think this is for RADx. It is also useful for BioPortal in general and anyone working with OBO ontologies.

matthewhorridge commented 7 months ago

More details (in place here)...

We are looking for terms that have this IRI pattern: http://purl.obolibrary.org/obo/{OntologyAcronym}_{NumericId}. These should be indexed against the string {OntologyAcronym}:{NumericId}

marcosmro commented 7 months ago

@mdorf Could you please break down this issue into a list of low-level tasks, and provide an estimate of the time needed for completion?

mdorf commented 7 months ago

More details (in place here)...

We are looking for terms that have this IRI pattern: http://purl.obolibrary.org/obo/{OntologyAcronym}_{NumericId}. These should be indexed against the string {OntologyAcronym}:{NumericId}

Is there a generic algorithm that would apply to any onotlogy ID? Say, {OntologyAcronym}:{Last Fragment of ID}?

mdorf commented 7 months ago

@mdorf Could you please break down this issue into a list of low-level tasks, and provide an estimate of the time needed for completion?

Probably something like this:

  1. Fix the existing search on short IDs (with no colons) - 3 days
  2. Enable search on the short IDs with colons for ontology terms (need some time to investigate this, as we had purposefully avoided this case for some reason) ~ 5 days
  3. Implement missing support for ontology IRIs in BioPortal ~ 2 days
  4. Enable search on the {OntologyAcronym}:{TermCode} ~ 4 days
matthewhorridge commented 7 months ago

More details (in place here)... We are looking for terms that have this IRI pattern: http://purl.obolibrary.org/obo/{OntologyAcronym}_{NumericId}. These should be indexed against the string {OntologyAcronym}:{NumericId}

Is there a generic algorithm that would apply to any onotlogy ID? Say, {OntologyAcronym}:{Last Fragment of ID}?

I think this could be true.

matthewhorridge commented 7 months ago

Just a note... because multiple ontologies can reuse terms something like

http://purl.obolibrary.org/obo/{CL}_{0000001} could appear in multiple ontologies. However, the very top hit should be the CL ontology.

mdorf commented 4 months ago

@matthewhorridge, below are some example terms from different ontologies that we went over in the meeting. Would it be possible for you to fill in the results for each example that show what the short ID would look like? Also, if you can document the general rules for extracting these short IDs, it would be great. Ideally, this solution should handle ALL variations of ontologies in our system to be relatively generic.

Acronym: NCIT ID: http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C20047 prefixIRI: Thesaurus:C20047 Result: NCIT:C20047

Acronym: LOINC ID: http://purl.bioontology.org/ontology/LNC/MTHU000231 notation: MTHU000231 Result: ---

Acronym: DRON ID: http://purl.obolibrary.org/obo/CHEBI_46195 notation: CHEBI:46195 Result: CHEBI:46195

Acronym: RXNORM ID: http://purl.bioontology.org/ontology/RXNORM/202433 notation: 202433 Result: --- Acronym: GO ID: http://purl.obolibrary.org/obo/GO_0050892 notation: GO:0050892 Result: GO:0050892

Acronym: ONS ID: http://purl.obolibrary.org/obo/GO_0003872 prefixIRI: GO:0003872 Result: GO:0003872

Acronym: BAO ID: http://www.bioassayontology.org/bao#BAO_0003114 prefixIRI: bao:BAO_0003114 Result: BAO:0003114

Acronym: GFO ID: http://www.onto-med.de/ontologies/gfo.owl#Relational_role prefixIRI: gfo:Relational_role Result: ---

Acronym: UNITSONT ID: http://mimi.case.edu/ontologies/2009/1/UnitsOntology#base_unit prefixIRI: unit:base_unit Result: ---

Acronym: ICF ID: http://who.int/icf#b126 prefixIRI: b126 Result: ---

Acronym: EDAM ID: http://edamontology.org/data_1598 prefixIRI: data_1598 Result: data:1598

Acronym: PMA ID: http://www.bioontology.org/pma.owl#PMA_357 prefixIRI: PMA_357 Result: PMA:357

Acronym: NDF-RT ID: http://evs.nci.nih.gov/ftp1/NDF-RT/NDF-RT.owl#N0000011165 prefixIRI: N0000011165 Result: ---

matthewhorridge commented 4 months ago

@mdorf I have updated your post with results. The basic algorithm is:

1) Find this regex in the ID: ([A-Za-z]+)_([0-9]+)$ 2) If found, then the "OboId" is formed from the regex result as $1:$2

This regex might be a little bit conservative but I'd prefer to stick to this for now.

I think the name of the field should be OboId (i.e. where we currently have result).

If the $1 group match is equal to the ontology ancronym then this result should be boosted to be the top search result

Note that NCIT is a special case here because we don't have the OBO version of it in BioPortal

mdorf commented 4 months ago

Thank you, @matthewhorridge for documenting these. It seems NCIT is a special case for both rules 1. and 2., correct? It does not match the regex AND the prefix of the OBO ID is formed using the ontology acronym instead of the $1 match.

matthewhorridge commented 4 months ago

Yes, that's right. One possibility is that if the above rules fail to match the regex then,

If the term ID starts with the ontology IRI, (1) remove the ontology IRI matching part, (2) next remove the first character of the remaining part (# or / would be expected) and then (3) take the ontology acronym append a colon and append the remaining term ID characters. For example,

Given, http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C20047 with an ontology IRI of http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl and an ontology acronym of NCIT,

(1) Remove http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl from the term ID, to give #C20047. (2) Remove the next character, #, to give C20047 (3) Concatentate the ontology acronym (NCIT) with a colon and the remainder of the term ID i.e. NCIT:C20047

mdorf commented 4 months ago

If the term ID starts with the ontology IRI

The only issue with that is that we don't expose the ontology IRI via our API or store it as metadata at the moment.

mdorf commented 4 months ago

These rules require implementing an additional functionality in BioPortal that would allow retrieving the ontology IRIs. Timeline adjusted accordingly...

matthewhorridge commented 4 months ago

As a first step would it be possible to implement the functionality that does not require looking at the ontology IRI? (So do this as a two step implementation)

mdorf commented 4 months ago

As a first step would it be possible to implement the functionality that does not require looking at the ontology IRI? (So do this as a two step implementation)

@matthewhorridge, we actually have an existing pull request from AgroPortal that implements the ability to retrieve the ontology IRI. The only issue is this PR is two years old, and some code has diverged from its original implementation, which requires a bit of manual work during the merge. I don't expect it to be a huge undertaking, so my recommendation is to roll it in as part of this development. It's also a very important and useful metadata attribute to be exposed via the BioPortal API.

alexskr commented 3 months ago

Short ID search enhancement has been deployed in BioPortal