EBISPOT / efo

Github repo for the Experimental Factor Ontology (EFO)
https://www.ebi.ac.uk/efo/
56 stars 13 forks source link

Large number of duplicated terms #1645

Closed d0choa closed 2 years ago

d0choa commented 2 years ago

At least in 3.42 and 3.43, there are a large number of duplicated terms in EFO mostly affecting rare diseases.

Just by lower-casing the names and looking for exact matches, there are 3036 duplicated terms (v3.42). Some of them are explained by disease vs phenotype conondrum, but the vast majority correspond to a MONDO vs Orphanet duplication.

Some examples:

Hemophilia Orphanet:448 - hemophilia MONDO:0018660 Fragile X syndrome Orphanet:908 - fragile X syndrome MONDO:0010383 Apert syndrome Orphanet:87 - apert syndrome MONDO:0007041

...

zoependlington commented 2 years ago

Hi @d0choa, I believe this is due to the gradual replacement of Orphanet terms with Mondo, all of these duplicates will eventually be an obsoleted Orphanet term with a replaced by link to the Mondo term. I will try to prioritise the removal of some of these in time for the July (18th) release.

zoependlington commented 2 years ago

The Orphanet terms have now been obsoleted and replaced with Mondo terms which should now fix this duplication after the July release - please let me know if it persists.

ireneisdoomed commented 2 years ago

I have checked the latest release (3.44.0) and we no longer have Orphanet/MONDO duplication. Thanks @zoependlington!

However, there are still 49 examples with an identical name after converting them to lowercase. Some of them, like arterial occlusion might be coming from the disease vs. phenotype conundrum. image

d0choa commented 2 years ago

3 of these have already been fixed in #1698

Many others remain genuine duplications (e.g. polycistic kidney disease)

zoependlington commented 2 years ago

I will add mappings for the following:

http://purl.obolibrary.org/obo/MONDO_0021184    http://www.ebi.ac.uk/efo/EFO_1001303    deltaretrovirus infections  deltaretrovirus infections
http://purl.obolibrary.org/obo/MONDO_0011014    http://www.ebi.ac.uk/efo/EFO_0009052    Pleuropulmonary blastoma    Pleuropulmonary blastoma
http://purl.obolibrary.org/obo/MONDO_0700092    http://www.ebi.ac.uk/efo/EFO_0010642    neurodevelopmental disorder neurodevelopmental disorder
http://purl.obolibrary.org/obo/MONDO_0012368    http://www.ebi.ac.uk/efo/EFO_1001981    aminoacylase 1 deficiency   aminoacylase 1 deficiency
http://purl.obolibrary.org/obo/MONDO_0020642    http://www.ebi.ac.uk/efo/EFO_0008620    Polycystic Kidney Disease   Polycystic Kidney Disease
http://purl.obolibrary.org/obo/MONDO_0019165    http://www.ebi.ac.uk/efo/EFO_0009029    Central precocious puberty  Central precocious puberty
http://purl.obolibrary.org/obo/MONDO_0014776    http://www.ebi.ac.uk/efo/EFO_0009059    Spinocerebellar ataxia type 42  Spinocerebellar ataxia type 42

The rest have either been taken care of (the measurement terms pointed out above) or are phenotype vs disease.

zoependlington commented 2 years ago

These mappings have now been added so the only duplicates should now be between disease/phenotype terms. Please let me know if that isn't the case.