Open AsierGonzalez opened 3 years ago
A quick note, the following tickets are also related to the issues above: https://github.com/EBISPOT/efo/issues/276 https://github.com/EBISPOT/efo/issues/750
Note to self: I've managed to identify 422 cases here for exact synonyms occurring in other terms and 4,867 where a term has the same label and synonym.
Of these 303+2987 are hopefully directly editable synonyms in EFO. I will focus on getting these fixed for the next release.
Update: It looks like many of these that I have looked into come from Mondo mappings e.g. MONDO_0007896 (acute monocytic leukemia) & EFO_0000221 (acute monocytic leukemia) are mapped, with the Mondo term bringing in the synonym of monocytic leukemia:
We also import from Mondo - MONDO_0004600 (monocytic leukemia), therefore this issue will need to be discussed with the Mondo team.
Many of these may have been fixed by the most recent cancer branch update. Will look into this further.
Hi!
We have received an email via the help desk about incoherent evidence for acute kidney failure in our pipeline that processes literature from EPMC that is ultimately related to the issue that is described here.
To provide some context, this pipeline captures all co occurrences between targets and diseases described in literature. These hits are represented in free text, which we later post process to map the labels to our entities: for targets we use our index based on Ensembl and for diseases our index based on EFO. We call this process "grounding".
The user is reporting potentially duplicated in evidence that associates SDC1 with Acute kidney injury(HP_0001919) and acute kidney failure (MONDO_0002492). These are not duplicates as such, because in principle one refers to the phenotype and the other to the disease.
The issue that I want to raise here is that acute kidney failure
, the label of the disease in MONDO, is also a synonym of the phenotype in HP. Our grounding algorithm will not be able to discern from the label whether it is referring to a phenotype, or to a disease, hence mapping it to both terms.
I want to raise the question of whether EFO should do an extra step where these cases are checked when it comes to importing ontologies. Does it make sense that two terms that share the same label coexist in different branches of the ontology?
Noting the following duplicate labels in EFO OTAR Slim v3.63.0
While investigating issues with disease mappings to EFO in Open Targets we have realised that labels and synonyms are not unique as we would expect for the majority of the terms. We have done a systematic analysis of EFO versions 3.24.0 and 3.25.0 using the JSON files available to download to measure how widespread this issue is. The numbers presented here belong to version 3.25.0.
The full description of the problem and the results of the analysis can be found in this slide deck.
Considering that every EFO id has a label and zero or more synonyms (Note: no distinction has been made between differen types of synonyms, e.g.
exact
andrelated
synonyms) there are three problematic situations and an extra one that although it doesn't hamper the mapping it's odd:The same disease label is shared by multiple ids: This should never happen but it does, although the numbers are very low and some repeats may be due to issues in the JSON rather than problems in the ontology. The most clear example is
3-phosphoglyceric acid
, which is the label of both EFO_0010450 and CHEBI_17050. There are14
ids with repeated labels when the original spelling is used. When all the labels are converted into lower-case the number goes up to56
. This second approach is useful to identify cases like HP_0100825 - "Cheilitis" vs MONDO_0002102 - "cheilitis" or EFO_1000616 - "Uveal Melanoma" vs HP_0007716 - "Uveal melanoma". The list of repeated labels and their ids can be found in this spreadsheet (tabefo_3_25_0_repeated_labels
for the first one andefo_3_25_0_repeated_labels_lowercase
for the second one).The label is also a synonym: This case does not really cause any problems for the mapping, but the duplication is unnecessary. There are
5,007
ids in which this happens (full table in tabefo_3_25_0_equal_label_and_synonym
) and5,497
more if the labels and synonyms are in lower-case (tabefo_3_25_0_equal_label_and_synonym_lowercase
).The label os a disease is the synonym of another: There are
1,236
ids affected (tabefo_3_25_0_intersect_labels_synonyms_lowercase
),2,209
in lower-case (tabefo_3_25_0_intersect_labels_synonyms
). In the spreadhseet, the columnrepeated_name
contains the repeated label,id_1
is the id of the EFO term whose label is repeated andid_2
is the id of the term whose synonym is repeated.The synonym appears in multiple ids: This affects the mapping but it's possible that it mostly happens with non-disease terms like cell lines, which may have very similar names and, hence, repeated synonyms. There are
3,196
ids with this problem (tabefo_3_25_0_repeated_synonym
),3,887
in lower-case (tabefo_3_25_0_repeated_synonym_lowercase
).In total, considering that some terms may have more than one of the three problematic categories mentioned above (1, 3 and 4), there are
3,619
EFO terms affected (full list in tabproblematic_ids
). The total goes up to4,548
when the analysis is done in lower-case (tabproblematic_lowercase_ids
).