EBISPOT / efo

Github repo for the Experimental Factor Ontology (EFO)
https://www.ebi.ac.uk/efo/
56 stars 13 forks source link

Cross-references using non-standard prefixes #878

Closed dhimmel closed 3 years ago

dhimmel commented 4 years ago

Ideally cross-references (oboInOwl:hasDbXref relationships) would use the most standard prefix to identify an external terminology. https://registry.identifiers.org/ is the authority I tend to follow when selecting resource prefixes.

Certain cross-references don't use the standard identifier.org prefix. For example, MeSH terms are split between using a prefix of MeSH (standard, ignoring capitalization) and MSH (non-standard).

I imagine many of these cross-references are imported from upstream resources. Nonetheless it might make sense for EFO to process the xrefs it imports to standardize their prefixes (and do other clean up like strip whitespace as per https://github.com/EBISPOT/efo/issues/872).

Will follow up with more examples.

dhimmel commented 4 years ago

All cross-reference prefixes

I looked at lowercase xref prefixes and counted the number of occurrences in EFO v3.23.0. This allows us to see all prefixes that are in use and identify multiple prefixes for the same resource.

Here's the table head:

xref_prefix count
umls 11879
icd10 9570
omim 8509
ncit 8395
mondo 8368
sctid 6062
doid 5204
mesh 5175
msh 5019
expand for full xref prefix count table | xref_prefix | count | |:-----------------------------|--------------:| | umls | 11879 | | icd10 | 9570 | | omim | 8509 | | ncit | 8395 | | mondo | 8368 | | sctid | 6062 | | doid | 5204 | | mesh | 5175 | | msh | 5019 | | gard | 3915 | | icd9 | 3904 | | snomedct | 3370 | | pmid | 1978 | | http | 1627 | | meddra | 1483 | | orphanet | 1478 | | citexplore | 1406 | | snomedct_us | 1011 | | cohd | 975 | | fma | 751 | | efo | 737 | | bto | 702 | | emapa | 597 | | wikipedia | 526 | | icdo | 516 | | zfa | 511 | | ma | 501 | | omimps | 460 | | hp | 432 | | oncotree | 422 | | reactome | 410 | | caloha | 374 | | tao | 369 | | kegg compound | 356 | | vhog | 343 | | chemidplus | 336 | | mat | 325 | | gaid | 312 | | ehdaa | 266 | | ehdaa2 | 263 | | chembl | 261 | | fbbt | 260 | | xao | 256 | | aao | 252 | | opencyc | 246 | | ev | 244 | | reaxys | 218 | | mo | 213 | | beilstein | 204 | | patent | 204 | | miaa | 185 | | galen | 178 | | bams | 177 | | kegg drug | 166 | | fmaid | 148 | | drugbank | 143 | | dc | 138 | | nifstd | 134 | | nist chemistry webbook | 123 | | birnlex | 102 | | sael | 93 | | gmelin | 88 | | pdbechem | 86 | | dhba | 83 | | kegg | 82 | | ordo | 74 | | umls_cui | 72 | | wbbt | 62 | | hba | 58 | | mba | 55 | | po | 53 | | zfs | 52 | | dmba | 49 | | bm | 48 | | https | 44 | | hmdb | 42 | | cas | 41 | | gtr | 41 | | mfo | 41 | | atcc | 40 | | fyler | 40 | | bila | 35 | | metacyc | 35 | | tads | 35 | | mp | 34 | | tgma | 34 | | aeo | 33 | | csp | 33 | | wbls | 31 | | hgnc | 29 | | fbdv | 28 | | pba | 27 | | snomedct_2010_1_31 | 26 | | vsao | 25 | | hao | 23 | | retired_ehdaa2 | 22 | | dsstox_generic_sid | 20 | | vfb | 20 | | nif_subcellular | 19 | | casrn | 19 | | go | 18 | | cmo | 17 | | knapsack | 17 | | sctid_2010_1_31 | 16 | | nlxanat | 16 | | mcc | 16 | | rgd | 14 | | atc_code | 14 | | goc | 14 | | zea | 13 | | lipid_maps_instance | 12 | | pato | 12 | | oges | 12 | | icd-10 | 12 | | bsa | 11 | | icd9cm | 10 | | caro | 10 | | ncim | 10 | | ero | 10 | | jax | 10 | | evm | 9 | | cl | 9 | | medgen | 9 | | ncithesaurus | 8 | | envo | 8 | | mmusdv | 7 | | um-bbd | 7 | | kupo | 7 | | obi | 6 | | icd10cm | 6 | | drug_central | 6 | | ehdaa2_retired | 5 | | idomal | 5 | | webelements | 5 | | isbn | 5 | | birn_anat | 5 | | person | 5 | | dsstox_cid | 4 | | chebi | 4 | | emapa_retired | 4 | | bils | 4 | | nlx | 4 | | bilado | 4 | | hsapdv | 3 | | bao | 3 | | gc_id | 3 | | resid | 3 | | submitter | 3 | | oae | 3 | | come | 3 | | wikipediacategory | 3 | | clo | 3 | | uberon | 3 | | mfmo | 3 | | mfomd | 3 | | fbtc | 3 | | aniseed | 3 | | lincs | 3 | | molbase | 3 | | bfo | 2 | | ido | 2 | | chemspider | 2 | | germplasm | 2 | | omit | 2 | | symp | 2 | | nifstd_retired | 2 | | wikidata | 2 | | lipid maps | 2 | | medlineplus | 2 | | ogms | 2 | | structure_chemicalname_iupac | 2 | | epcc | 2 | | structure_formula | 2 | | nci metathesaurus | 2 | | icd11 | 2 | | pdumdv | 1 | | po_git | 1 | | snomedct_us_2018_03_01 | 1 | | umls cui | 1 | | xtrodo | 1 | | pro | 1 | | to | 1 | | spd | 1 | | scdo | 1 | | fao/who_standards | 1 | | isbn-10 | 1 | | medra | 1 | | map | 1 | | tao_retired | 1 | | loinc | 1 | | obo | 1 | | uniprot | 1 | | orcid | 1 | | nci_thesaurus | 1 | | ogem | 1 | | dermo | 1 | | zfa_retired | 1 | | te | 1 | | snomed | 1 | | bpdb | 1 | | nci | 1 | | url | 1 | | apweb | 1 | | ppdb | 1 | | modelled on http | 1 | | atcc number | 1 | | fbbt_root | 1 | | ehda | 1 | | aba | 1 | | drerdo | 1 | | um-bbd_compid | 1 | | nif_cell | 1 | | ndfrt | 1 | | similar to cl | 1 | | mth | 1 | | isbn-13 | 1 | | pdb | 1 | | fao | 1 |
dhimmel commented 4 years ago

Currently I'm including the following in my SPARQL query to help standardize xref prefixes:

  BIND( LCASE(STRBEFORE( ?xref, ":" )) AS ?xref_prefix_dirty )
  # Standardize prefixes. https://github.com/EBISPOT/efo/issues/878
  BIND(
    COALESCE(
      # https://blog.semaku.com/post/140876753748/using-coalesce-and-if-in-sparql-for-nested
      IF(?xref_prefix_dirty = "msh", "mesh", ?error),
      IF(?xref_prefix_dirty = "icd-10", "icd10", ?error),
      IF(?xref_prefix_dirty = "umls_cui", "umls", ?error),
      # Looked at several of these SNOMEDCT_US terms (US Edition of SNOMED CT) and they existed in the International Edition
      IF(?xref_prefix_dirty = "snomedct_us", "snomedct", ?error),
      IF(?xref_prefix_dirty = "snomedct_2010_1_31", "snomedct", ?error),
      IF(?xref_prefix_dirty = "snomedct_us_2018_03_01", "snomedct", ?error),
      ?xref_prefix_dirty
    ) AS ?xref_prefix
  )

Would it make sense to bring something like this upstream so all EFO users can benefit from cleaner and more standard xrefs?

paolaroncaglia commented 4 years ago

Quick note: this is related to https://github.com/EBISPOT/efo/issues/141.

zoependlington commented 3 years ago

Thank you for pointing these out @dhimmel, we have fixed NCIt, Mesh and SNOMEDCT in EFO. If there are any other broken namespace prefixes, please let us know in a new ticket. I'll now move this to done.