EBISPOT / efo

Github repo for the Experimental Factor Ontology (EFO)
https://www.ebi.ac.uk/efo/
56 stars 13 forks source link

Some easy-to-detect invalid cross-references #872

Closed dhimmel closed 3 years ago

dhimmel commented 4 years ago

We're using oboInOwl:hasDbXref relationships to translate information from other terminologies into EFO terms.

There are a few easy-to-detect xref values that are likely invalid. I imagine xrefs are often imported from elsewhere? So some of these are likely upstream problems. But others might originate with EFO.

whitespace

There are some xrefs with leading / trailing whitespace (not all whitespace renders below, but I looked for when the trimmed version of xref did not match xref):

efo_uri efo_id efo_label xref xref_prefix xref_accession
http://purl.obolibrary.org/obo/CHEBI_73177 CHEBI:73177 brassinazole CiteXplore:10806228\n citexplore 10806228\n
http://purl.obolibrary.org/obo/CHEBI_73177 CHEBI:73177 brassinazole CiteXplore:11144262\n citexplore 11144262\n
http://purl.obolibrary.org/obo/CHEBI_73177 CHEBI:73177 brassinazole CiteXplore:16038953\n\n citexplore 16038953\n\n
http://purl.obolibrary.org/obo/CL_0000000 CL:0000000 cell CALOHA:TS-2035\n caloha TS-2035\n
http://purl.obolibrary.org/obo/CL_0000000 CL:0000000 cell CALOHA:TS-2035\n caloha TS-2035\n
http://purl.obolibrary.org/obo/CL_0000000 CL:0000000 cell FMA:68646\n fma 68646\n
http://purl.obolibrary.org/obo/CL_0000000 CL:0000000 cell FMA:68646\n fma 68646\n
http://purl.obolibrary.org/obo/CL_0000000 CL:0000000 cell GO:0005623\n go 0005623\n
http://purl.obolibrary.org/obo/CL_0000000 CL:0000000 cell GO:0005623\n go 0005623\n
http://purl.obolibrary.org/obo/CL_0000000 CL:0000000 cell KUPO:0000002\n\n kupo 0000002\n\n
http://purl.obolibrary.org/obo/CL_0000000 CL:0000000 cell KUPO:0000002\n\n kupo 0000002\n\n
http://purl.obolibrary.org/obo/CL_0000000 CL:0000000 cell VHOG:0001533\n\n vhog 0001533\n\n
http://purl.obolibrary.org/obo/CL_0000000 CL:0000000 cell VHOG:0001533\n\n vhog 0001533\n\n
http://purl.obolibrary.org/obo/CL_0000000 CL:0000000 cell WBbt:0004017\n wbbt 0004017\n
http://purl.obolibrary.org/obo/CL_0000000 CL:0000000 cell WBbt:0004017\n wbbt 0004017\n
http://purl.obolibrary.org/obo/CL_0000000 CL:0000000 cell XAO:0003012\n\n xao 0003012\n\n
http://purl.obolibrary.org/obo/CL_0000000 CL:0000000 cell XAO:0003012\n\n xao 0003012\n\n
http://www.ebi.ac.uk/efo/EFO_0006318 EFO:0006318 breast ductal adenocarcinoma UMLS:C1527349 umls C1527349
http://www.ebi.ac.uk/efo/EFO_0006740 EFO:0006740 pulmonary mucoepidermoid carcinoma PMID:24303221 pmid 24303221
http://www.ebi.ac.uk/efo/EFO_0007725 EFO:0007725 embryo stage BilaDO:0000002 bilado 0000002
http://www.ebi.ac.uk/efo/EFO_0007725 EFO:0007725 embryo stage EV:0300001 ev 0300001
http://www.ebi.ac.uk/efo/EFO_0007725 EFO:0007725 embryo stage FMA:72652 fma 72652
http://www.ebi.ac.uk/efo/EFO_0007725 EFO:0007725 embryo stage HsapDv:0000002 hsapdv 0000002
http://www.ebi.ac.uk/efo/EFO_0007725 EFO:0007725 embryo stage MmusDv:0000002 mmusdv 0000002
http://www.ebi.ac.uk/efo/EFO_0007725 EFO:0007725 embryo stage OGES:000022 oges 000022
http://www.ebi.ac.uk/efo/EFO_0007725 EFO:0007725 embryo stage WBls:0000092 wbls 0000092
http://www.ebi.ac.uk/efo/EFO_0007725 EFO:0007725 embryo stage WBls:0000102 wbls 0000102
http://www.ebi.ac.uk/efo/EFO_0007725 EFO:0007725 embryo stage XAO:1000012 xao 1000012
http://www.ebi.ac.uk/efo/EFO_0007836 EFO:0007836 coenzyme Q10 measurement PMID:27149984 pmid 27149984
http://www.ebi.ac.uk/efo/EFO_0010695 EFO:0010695 elevated lactate dehydrogenase PMID:25167691 pmid 25167691
http://www.ebi.ac.uk/efo/EFO_0010720 EFO:0010720 invasive mechanical ventilation ICD9:96.7 icd9 96.7
http://www.ebi.ac.uk/efo/EFO_0010721 EFO:0010721 lung transplantation SNOMEDCT:88039007 snomedct 88039007
http://purl.obolibrary.org/obo/HP_0006695 HP:0006695 Atrioventricular canal defect DOID:50651 doid 50651

mesh identifiers

Using the following regex for valid MeSH IDs: ^[CD][0-9]{6}([0-9]{3}|)$.

efo_id efo_label mesh_id
EFO:0004122 obsolete_neurofibromatosis type II DO16518
EFO:0004137 obsolete_epidermolytic hyperkeratosis D0017488
EFO:0004352 mortality Q000401
EFO:0007046 executive function C0935584
EFO:0007233 diaphragm disease NoID
EFO:0007422 parotid disease NoID
EFO:0007441 placenta disease NoID
EFO:0010581 organophosphate poisoning 68062025
EFO:1001216 tooth disease DO14076

I didn't include xrefs to MeSH tree locations, which were almost entirely from UBERON and reported upstream in https://github.com/obophenotype/uberon/issues/698.

zoependlington commented 4 years ago

Thanks @dhimmel, this has been a gradual clean up in EFO but I will definitely look into this a little more and see if there are any EFO-originating invalid cross-references that need cleaning up.

zoependlington commented 3 years ago

These have now been looked at and fixed during our other clean ups, therefore I'll move this to done.