Open matthewhorridge opened 7 months ago
BioPortal doesn't have OBO version of the NCIT at the moment and owl version useThesaurus:C16670
form of ID; however, search still fails for it
Please could we use NCIT:CXXXXX
as a synonym for Thesaurus:CXXXXX
?
In general the pattern is {OntologyAcronym}:{TermCode}. The ontology with {OntologyAcronym} as it's acronym should always be the top hit, ignoring all other metrics.
The more I work with these kinds of terms and searches for them the more critical I think this is for RADx. It is also useful for BioPortal in general and anyone working with OBO ontologies.
More details (in place here)...
We are looking for terms that have this IRI pattern: http://purl.obolibrary.org/obo/{OntologyAcronym}_{NumericId}
. These should be indexed against the string {OntologyAcronym}:{NumericId}
@mdorf Could you please break down this issue into a list of low-level tasks, and provide an estimate of the time needed for completion?
More details (in place here)...
We are looking for terms that have this IRI pattern:
http://purl.obolibrary.org/obo/{OntologyAcronym}_{NumericId}
. These should be indexed against the string{OntologyAcronym}:{NumericId}
Is there a generic algorithm that would apply to any onotlogy ID? Say, {OntologyAcronym}:{Last Fragment of ID}?
@mdorf Could you please break down this issue into a list of low-level tasks, and provide an estimate of the time needed for completion?
Probably something like this:
More details (in place here)... We are looking for terms that have this IRI pattern:
http://purl.obolibrary.org/obo/{OntologyAcronym}_{NumericId}
. These should be indexed against the string{OntologyAcronym}:{NumericId}
Is there a generic algorithm that would apply to any onotlogy ID? Say, {OntologyAcronym}:{Last Fragment of ID}?
I think this could be true.
Just a note... because multiple ontologies can reuse terms something like
http://purl.obolibrary.org/obo/{CL}_{0000001}
could appear in multiple ontologies. However, the very top hit should be the CL
ontology.
@matthewhorridge, below are some example terms from different ontologies that we went over in the meeting. Would it be possible for you to fill in the results for each example that show what the short ID would look like? Also, if you can document the general rules for extracting these short IDs, it would be great. Ideally, this solution should handle ALL variations of ontologies in our system to be relatively generic.
Acronym: NCIT
ID: http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C20047
prefixIRI: Thesaurus:C20047
Result: NCIT:C20047
Acronym: LOINC ID: http://purl.bioontology.org/ontology/LNC/MTHU000231 notation: MTHU000231 Result: ---
Acronym: DRON
ID: http://purl.obolibrary.org/obo/CHEBI_46195
notation: CHEBI:46195
Result: CHEBI:46195
Acronym: RXNORM
ID: http://purl.bioontology.org/ontology/RXNORM/202433
notation: 202433
Result: ---
Acronym: GO
ID: http://purl.obolibrary.org/obo/GO_0050892
notation: GO:0050892
Result: GO:0050892
Acronym: ONS
ID: http://purl.obolibrary.org/obo/GO_0003872
prefixIRI: GO:0003872
Result: GO:0003872
Acronym: BAO
ID: http://www.bioassayontology.org/bao#BAO_0003114
prefixIRI: bao:BAO_0003114
Result: BAO:0003114
Acronym: GFO ID: http://www.onto-med.de/ontologies/gfo.owl#Relational_role prefixIRI: gfo:Relational_role Result: ---
Acronym: UNITSONT ID: http://mimi.case.edu/ontologies/2009/1/UnitsOntology#base_unit prefixIRI: unit:base_unit Result: ---
Acronym: ICF ID: http://who.int/icf#b126 prefixIRI: b126 Result: ---
Acronym: EDAM
ID: http://edamontology.org/data_1598
prefixIRI: data_1598
Result: data:1598
Acronym: PMA
ID: http://www.bioontology.org/pma.owl#PMA_357
prefixIRI: PMA_357
Result: PMA:357
Acronym: NDF-RT ID: http://evs.nci.nih.gov/ftp1/NDF-RT/NDF-RT.owl#N0000011165 prefixIRI: N0000011165 Result: ---
@mdorf I have updated your post with results. The basic algorithm is:
1) Find this regex in the ID: ([A-Za-z]+)_([0-9]+)$
2) If found, then the "OboId" is formed from the regex result as $1:$2
This regex might be a little bit conservative but I'd prefer to stick to this for now.
I think the name of the field should be OboId (i.e. where we currently have result).
If the $1
group match is equal to the ontology ancronym then this result should be boosted to be the top search result
Note that NCIT is a special case here because we don't have the OBO version of it in BioPortal
Thank you, @matthewhorridge for documenting these. It seems NCIT is a special case for both rules 1. and 2., correct? It does not match the regex AND the prefix of the OBO ID is formed using the ontology acronym instead of the $1 match.
Yes, that's right. One possibility is that if the above rules fail to match the regex then,
If the term ID starts with the ontology IRI, (1) remove the ontology IRI matching part, (2) next remove the first character of the remaining part (#
or /
would be expected) and then (3) take the ontology acronym append a colon and append the remaining term ID characters. For example,
Given, http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C20047
with an ontology IRI of http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl
and an ontology acronym of NCIT
,
(1) Remove http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl
from the term ID, to give #C20047
.
(2) Remove the next character, #
, to give C20047
(3) Concatentate the ontology acronym (NCIT
) with a colon and the remainder of the term ID i.e. NCIT:C20047
If the term ID starts with the ontology IRI
The only issue with that is that we don't expose the ontology IRI via our API or store it as metadata at the moment.
These rules require implementing an additional functionality in BioPortal that would allow retrieving the ontology IRIs. Timeline adjusted accordingly...
As a first step would it be possible to implement the functionality that does not require looking at the ontology IRI? (So do this as a two step implementation)
As a first step would it be possible to implement the functionality that does not require looking at the ontology IRI? (So do this as a two step implementation)
@matthewhorridge, we actually have an existing pull request from AgroPortal that implements the ability to retrieve the ontology IRI. The only issue is this PR is two years old, and some code has diverged from its original implementation, which requires a bit of manual work during the merge. I don't expect it to be a huge undertaking, so my recommendation is to roll it in as part of this development. It's also a very important and useful metadata attribute to be exposed via the BioPortal API.
Short ID search enhancement has been deployed in BioPortal
The RADx Data Dictionary Specification allows ontology terms to be provided for Data Elements and Enumerations. These terms can be provided in a short form as OBO Ids. For example,
NCIT:C16670
.When I search for terms in BioPortal I get no results, e.g.
This should return the exact match of the term with the id in first position.