EBISPOT / efo

Github repo for the Experimental Factor Ontology (EFO)
https://www.ebi.ac.uk/efo/
54 stars 14 forks source link

Duplicate labels in EFO 3 #261

Closed paolaroncaglia closed 5 years ago

paolaroncaglia commented 5 years ago

Stemming from my casual realization that there are 2 ‘dementia’ in EFO 3, one from HP and one from MONDO, with two significantly different lineages, see https://github.com/EBISPOT/efo/issues/15#issuecomment-425030574.

Duplicate labels may or may not cause errors; @zoependlington notes:

“I think there are two ways around this. a) We check for duplicate labels and that causes an error - as is the case in EFO 2 (although I found a duplicate that apparently has gone unnoticed for a while the other day), or b) In this case, we have two terms with the same label BUT one is a phenotype and one is a disease term - we briefly touched on this the other day when we had an impromptu meeting with Chris. I guess this is a good thing to bring up in the next meeting since it may be a case that we have phenotype terms that are actually disease terms that need fixing or we need to make a decision on how to handle this”.

Either way, it looks like the identification of duplicate labels might not be failproof at this stage and might benefit from looking more closely in case there are more. @zoependlington kindly volunteered to generate a list of type b) above with a SPARQL query and copy it here.

Note: in the case of 'dementia', the duplicate labels are identical; but there may be cases where labels differ in capitalisation or presence/absence of hyphens.

zoependlington commented 5 years ago

Here are all duplicate labels in EFO 3:

https://docs.google.com/spreadsheets/d/1gj4BAt-3GmnLy0QS0VS8dFV0YCP6pj2wIme85PC1ZbE/edit?usp=sharing

paolaroncaglia commented 5 years ago

I’ve looked at all entries in the spreadsheet of duplicate labels. There are 51 couples of duplicate labels; of these, 13 couples are cases where one term is a disease and the other is a phenotype (require general discussion on going for disease overall, or not); 35 couples are cases where both terms are diseases (don’t require discussion, just merging - and resolving on whether to keep the EFO class as primary); 3 couples are other cases (require case-by-case analysis).

paolaroncaglia commented 5 years ago

In the second tab of the spreadsheet above, @zoependlington repeated the SPARQL query but case-insensitive this time. The search confirms the couples described above, and returns 39 extra couples of duplicate labels (I have gone through the tab and highlighted in orange the entries with differences in capitalisation, these are the couples that do NOT overlap with the one in the first tab of the spreadsheet). Will look at them tomorrow.

zoependlington commented 5 years ago

I ran the new query over EFO 2. The third tab on the sheet shows the two duplicate pairs that are in EFO 2 right now.

paolaroncaglia commented 5 years ago

@zoependlington FYI both are also in EFO 3.

zoependlington commented 5 years ago

@daniwelter has pointed out that nuc-seq and Nuc-seq are two different things, so this is not a duplication.

paolaroncaglia commented 5 years ago

Note for self

daniwelter commented 5 years ago

I think I can fix both the nuc-seq/Nuc-seq problem and the sequencing assay RAD one by updating the labels where appropriate and putting the duplicate label as a synonym

paolaroncaglia commented 5 years ago

I’ve looked at all entries in the spreadsheet of duplicate labels that differ by capitalisation. There are 39 couples of duplicate labels; of these, 17 couples are cases where one term is a disease and the other is a phenotype (require general discussion on going for disease overall, or not); 11 couples are cases where both terms are diseases (don’t require discussion, just merging - and resolving on whether to keep the EFO class as primary); 11 couples are other cases (require case-by-case analysis), but 2 have already been fixed by Dani, so 9 remain.

paolaroncaglia commented 5 years ago

Summing up this ticket so far:

Regardless of identical vs. differing capitalization,

zoependlington commented 5 years ago

@paolaroncaglia Regarding your second & third points: Any EFO vs. MONDO disease terms should be looked at regarding mapping. They may be missing from the mapping file and, consequently, the MONDO terms are being imported and not replaced with the EFO IDs. The same should be true with ORDO vs MONDO. I.e. Any disease term (EFO or ORDO) vs. MONDO should be mapped in the mapping file. I'll take a look at these terms and see what's going on there.

paolaroncaglia commented 5 years ago

Thanks @zoependlington !

zoependlington commented 5 years ago

Mapping issue for these disease terms should be fixed. I'm re-running the EFO 3 duplicates query and will put the results on the spreadsheet.

paolaroncaglia commented 5 years ago

@zoependlington I've looked at your last tab in the spreadsheet https://docs.google.com/spreadsheets/d/1gj4BAt-3GmnLy0QS0VS8dFV0YCP6pj2wIme85PC1ZbE/edit#gid=1432080342 ("EFO 3 duplicates after remapping"), and actually all cases are disease vs. phenotype (the only one that you marked otherwise, the EFO term is a child of an HP term). All other cases I reported previously seem to be gone after your remapping. So, to complete addressing this ticket, we need to resolve the issue of basically how would Open Targets like us to represent such cases. There seemed to be a preference for disease. Future discussion with OTAR is the object of https://github.com/EBISPOT/efo/issues/270

paolaroncaglia commented 5 years ago

I started a further tab in the spreadsheet https://docs.google.com/spreadsheets/d/1gj4BAt-3GmnLy0QS0VS8dFV0YCP6pj2wIme85PC1ZbE/edit#gid=353031662 called "Children and comments", and am going through it. To be completed.

FYI, during the process, I created a ticket for MONDO https://github.com/monarch-initiative/mondo/issues/438.

Minor edits:

zoependlington commented 5 years ago

To Do:

paolaroncaglia commented 5 years ago

Note for OntoTools scrum 1/11/2018: this is now in @zoependlington 's hands, as agreed with her. Thanks Zoe!

zoependlington commented 5 years ago

I have added http://purl.obolibrary.org/obo/MONDO_0018841 http://www.ebi.ac.uk/efo/EFO_0009039 congenital bile acid synthesis defect Congenital bile acid synthesis defect to the mapping file.

I have also reclassified aplastic anemia.

Outstanding issues:

The part of the makefile that needs to be edited is:

# unmerge imports from edit and add import files

imports/mondo_efo_import.owl: imports/mondo_efo_mappings.tsv imports/mondo_remove.owl
    java -jar ../../bin/mondo-id-switch.jar imports/mondo_efo_mappings.tsv imports/mondo_remove.owl imports/mondo_efo_import.owl && $(ROBOT) -v annotate -i imports/mondo_efo_import.owl --ontology-iri http://www.ebi.ac.uk/efo/imports/mondo_efo_import.owl -o $@

imports/mondo_remove.owl: imports/mondo_import.owl
    $(ROBOT) remove -i $< -T imports/duplicates.txt -o $@
zoependlington commented 5 years ago

Implemented with ROBOT 1.2.0 (updated the bin directory version of ROBOT) however still getting the "duplicated in the mapping file" messages for everything that's in the mapping file. I won't push this edit to GitHub just yet.

zoependlington commented 5 years ago

Added to EFO3. Duplication error remains. Will create new ticket to address this.