EBISPOT / efo

Github repo for the Experimental Factor Ontology (EFO)
https://www.ebi.ac.uk/efo/
56 stars 13 forks source link

Duplicated EFO labels and synonyms #925

Open AsierGonzalez opened 3 years ago

AsierGonzalez commented 3 years ago

While investigating issues with disease mappings to EFO in Open Targets we have realised that labels and synonyms are not unique as we would expect for the majority of the terms. We have done a systematic analysis of EFO versions 3.24.0 and 3.25.0 using the JSON files available to download to measure how widespread this issue is. The numbers presented here belong to version 3.25.0.

The full description of the problem and the results of the analysis can be found in this slide deck.

Considering that every EFO id has a label and zero or more synonyms (Note: no distinction has been made between differen types of synonyms, e.g. exact and related synonyms) there are three problematic situations and an extra one that although it doesn't hamper the mapping it's odd:

  1. The same disease label is shared by multiple ids: This should never happen but it does, although the numbers are very low and some repeats may be due to issues in the JSON rather than problems in the ontology. The most clear example is 3-phosphoglyceric acid, which is the label of both EFO_0010450 and CHEBI_17050. There are 14 ids with repeated labels when the original spelling is used. When all the labels are converted into lower-case the number goes up to 56. This second approach is useful to identify cases like HP_0100825 - "Cheilitis" vs MONDO_0002102 - "cheilitis" or EFO_1000616 - "Uveal Melanoma" vs HP_0007716 - "Uveal melanoma". The list of repeated labels and their ids can be found in this spreadsheet (tab efo_3_25_0_repeated_labels for the first one and efo_3_25_0_repeated_labels_lowercase for the second one).

  2. The label is also a synonym: This case does not really cause any problems for the mapping, but the duplication is unnecessary. There are 5,007 ids in which this happens (full table in tab efo_3_25_0_equal_label_and_synonym) and 5,497 more if the labels and synonyms are in lower-case (tab efo_3_25_0_equal_label_and_synonym_lowercase).

  3. The label os a disease is the synonym of another: There are 1,236 ids affected (tab efo_3_25_0_intersect_labels_synonyms_lowercase), 2,209 in lower-case (tab efo_3_25_0_intersect_labels_synonyms). In the spreadhseet, the column repeated_name contains the repeated label, id_1 is the id of the EFO term whose label is repeated and id_2 is the id of the term whose synonym is repeated.

  4. The synonym appears in multiple ids: This affects the mapping but it's possible that it mostly happens with non-disease terms like cell lines, which may have very similar names and, hence, repeated synonyms. There are 3,196 ids with this problem (tab efo_3_25_0_repeated_synonym), 3,887 in lower-case (tab efo_3_25_0_repeated_synonym_lowercase).

In total, considering that some terms may have more than one of the three problematic categories mentioned above (1, 3 and 4), there are 3,619 EFO terms affected (full list in tab problematic_ids). The total goes up to 4,548 when the analysis is done in lower-case (tab problematic_lowercase_ids).

paolaroncaglia commented 3 years ago

A quick note, the following tickets are also related to the issues above: https://github.com/EBISPOT/efo/issues/276 https://github.com/EBISPOT/efo/issues/750

zoependlington commented 3 years ago

Note to self: I've managed to identify 422 cases here for exact synonyms occurring in other terms and 4,867 where a term has the same label and synonym.

Of these 303+2987 are hopefully directly editable synonyms in EFO. I will focus on getting these fixed for the next release.

zoependlington commented 3 years ago

Update: It looks like many of these that I have looked into come from Mondo mappings e.g. MONDO_0007896 (acute monocytic leukemia) & EFO_0000221 (acute monocytic leukemia) are mapped, with the Mondo term bringing in the synonym of monocytic leukemia: image

We also import from Mondo - MONDO_0004600 (monocytic leukemia), therefore this issue will need to be discussed with the Mondo team.

zoependlington commented 3 years ago

Many of these may have been fixed by the most recent cancer branch update. Will look into this further.

ireneisdoomed commented 2 years ago

Hi!

We have received an email via the help desk about incoherent evidence for acute kidney failure in our pipeline that processes literature from EPMC that is ultimately related to the issue that is described here.

To provide some context, this pipeline captures all co occurrences between targets and diseases described in literature. These hits are represented in free text, which we later post process to map the labels to our entities: for targets we use our index based on Ensembl and for diseases our index based on EFO. We call this process "grounding".

The user is reporting potentially duplicated in evidence that associates SDC1 with Acute kidney injury(HP_0001919) and acute kidney failure (MONDO_0002492). These are not duplicates as such, because in principle one refers to the phenotype and the other to the disease.

The issue that I want to raise here is that acute kidney failure, the label of the disease in MONDO, is also a synonym of the phenotype in HP. Our grounding algorithm will not be able to discern from the label whether it is referring to a phenotype, or to a disease, hence mapping it to both terms.

I want to raise the question of whether EFO should do an extra step where these cases are checked when it comes to importing ontologies. Does it make sense that two terms that share the same label coexist in different branches of the ontology?

dhimmel commented 7 months ago

Noting the following duplicate labels in EFO OTAR Slim v3.63.0

Expand for code ```py from nxontology import NXOntology url = "https://github.com/related-sciences/nxontology-data/raw/a06970368ee9ee3b3109592cd58ad08918673d14/efo_otar_slim.json.gz" nxo = NXOntology.read_node_link_json(url) nxo.graph.graph nxo._get_name_to_node_info() ```