Closed paolaroncaglia closed 5 years ago
Here are all duplicate labels in EFO 3:
https://docs.google.com/spreadsheets/d/1gj4BAt-3GmnLy0QS0VS8dFV0YCP6pj2wIme85PC1ZbE/edit?usp=sharing
I’ve looked at all entries in the spreadsheet of duplicate labels. There are 51 couples of duplicate labels; of these, 13 couples are cases where one term is a disease and the other is a phenotype (require general discussion on going for disease overall, or not); 35 couples are cases where both terms are diseases (don’t require discussion, just merging - and resolving on whether to keep the EFO class as primary); 3 couples are other cases (require case-by-case analysis).
In the second tab of the spreadsheet above, @zoependlington repeated the SPARQL query but case-insensitive this time. The search confirms the couples described above, and returns 39 extra couples of duplicate labels (I have gone through the tab and highlighted in orange the entries with differences in capitalisation, these are the couples that do NOT overlap with the one in the first tab of the spreadsheet). Will look at them tomorrow.
I ran the new query over EFO 2. The third tab on the sheet shows the two duplicate pairs that are in EFO 2 right now.
@zoependlington FYI both are also in EFO 3.
@daniwelter has pointed out that nuc-seq and Nuc-seq are two different things, so this is not a duplication.
Note for self
I think I can fix both the nuc-seq/Nuc-seq problem and the sequencing assay RAD one by updating the labels where appropriate and putting the duplicate label as a synonym
I’ve looked at all entries in the spreadsheet of duplicate labels that differ by capitalisation. There are 39 couples of duplicate labels; of these, 17 couples are cases where one term is a disease and the other is a phenotype (require general discussion on going for disease overall, or not); 11 couples are cases where both terms are diseases (don’t require discussion, just merging - and resolving on whether to keep the EFO class as primary); 11 couples are other cases (require case-by-case analysis), but 2 have already been fixed by Dani, so 9 remain.
Summing up this ticket so far:
Regardless of identical vs. differing capitalization,
[ ] We need to address 30 cases of duplicate labels where one term is a phenotype and the other a disease. (Most of these are HP vs. MONDO, but a few are EFO vs. MONDO where the EFO term has an HP parent or ancestor.) This requires a general discussion on going for disease vs. phenotype overall, or not, partly depending on Open Targets needs; we'll address this general discussion, with a few neuro-related examples, at the Integration Day Neuro Workshop on Oct. 11th. UPDATE 3/12/2018: related ticket in MONDO tracker: https://github.com/monarch-initiative/mondo/issues/559
[x] We need to address 46 cases of duplicate labels where both terms are diseases. (These are all EFO vs. MONDO.) Do we merge couples of terms, and if so do what class do we keep as primary: EFO or MONDO? Or do we keep one term and deprecate the other?
[x] We need to address 12 cases of duplicate labels that are none of the above. (Most are MONDO vs. ORDO where the MONDO term is a descendant of 'disease or disorder' while the ORDO term is a descendant of 'group of disorders' which is a sibling of 'disease'.) Shall we keep MONDO classes or ORDO classes?
[x] Not a must-have but a would-be-nice and something I think we already mentioned in the past: it would be really nice (at least for a former GO editor...) to make all EFO labels NOT capitalised, other than for disease names based on their discoverer or similar of course. That way we would be more consistent, and would align better with MONDO (I recall that @cmungall pushed for that format in MONDO). We'd still have to deal with different capitalization in labels of terms imported from other ontologies). I can move this to a new ticket if we want to have a separate one for this in the backlog or icebox.
@paolaroncaglia Regarding your second & third points: Any EFO vs. MONDO disease terms should be looked at regarding mapping. They may be missing from the mapping file and, consequently, the MONDO terms are being imported and not replaced with the EFO IDs. The same should be true with ORDO vs MONDO. I.e. Any disease term (EFO or ORDO) vs. MONDO should be mapped in the mapping file. I'll take a look at these terms and see what's going on there.
Thanks @zoependlington !
Mapping issue for these disease terms should be fixed. I'm re-running the EFO 3 duplicates query and will put the results on the spreadsheet.
@zoependlington I've looked at your last tab in the spreadsheet https://docs.google.com/spreadsheets/d/1gj4BAt-3GmnLy0QS0VS8dFV0YCP6pj2wIme85PC1ZbE/edit#gid=1432080342 ("EFO 3 duplicates after remapping"), and actually all cases are disease vs. phenotype (the only one that you marked otherwise, the EFO term is a child of an HP term). All other cases I reported previously seem to be gone after your remapping. So, to complete addressing this ticket, we need to resolve the issue of basically how would Open Targets like us to represent such cases. There seemed to be a preference for disease. Future discussion with OTAR is the object of https://github.com/EBISPOT/efo/issues/270
I started a further tab in the spreadsheet https://docs.google.com/spreadsheets/d/1gj4BAt-3GmnLy0QS0VS8dFV0YCP6pj2wIme85PC1ZbE/edit#gid=353031662 called "Children and comments", and am going through it. To be completed.
FYI, during the process, I created a ticket for MONDO https://github.com/monarch-initiative/mondo/issues/438.
Minor edits:
To Do:
http://purl.obolibrary.org/obo/MONDO_0018841 http://www.ebi.ac.uk/efo/EFO_0009039 congenital bile acid synthesis defect Congenital bile acid synthesis defect
to the mapping file.Note for OntoTools scrum 1/11/2018: this is now in @zoependlington 's hands, as agreed with her. Thanks Zoe!
I have added http://purl.obolibrary.org/obo/MONDO_0018841 http://www.ebi.ac.uk/efo/EFO_0009039 congenital bile acid synthesis defect Congenital bile acid synthesis defect
to the mapping file.
I have also reclassified aplastic anemia.
Outstanding issues:
The part of the makefile that needs to be edited is:
# unmerge imports from edit and add import files
imports/mondo_efo_import.owl: imports/mondo_efo_mappings.tsv imports/mondo_remove.owl
java -jar ../../bin/mondo-id-switch.jar imports/mondo_efo_mappings.tsv imports/mondo_remove.owl imports/mondo_efo_import.owl && $(ROBOT) -v annotate -i imports/mondo_efo_import.owl --ontology-iri http://www.ebi.ac.uk/efo/imports/mondo_efo_import.owl -o $@
imports/mondo_remove.owl: imports/mondo_import.owl
$(ROBOT) remove -i $< -T imports/duplicates.txt -o $@
Implemented with ROBOT 1.2.0 (updated the bin directory version of ROBOT) however still getting the "duplicated in the mapping file" messages for everything that's in the mapping file. I won't push this edit to GitHub just yet.
Added to EFO3. Duplication error remains. Will create new ticket to address this.
Stemming from my casual realization that there are 2 ‘dementia’ in EFO 3, one from HP and one from MONDO, with two significantly different lineages, see https://github.com/EBISPOT/efo/issues/15#issuecomment-425030574.
Duplicate labels may or may not cause errors; @zoependlington notes:
“I think there are two ways around this. a) We check for duplicate labels and that causes an error - as is the case in EFO 2 (although I found a duplicate that apparently has gone unnoticed for a while the other day), or b) In this case, we have two terms with the same label BUT one is a phenotype and one is a disease term - we briefly touched on this the other day when we had an impromptu meeting with Chris. I guess this is a good thing to bring up in the next meeting since it may be a case that we have phenotype terms that are actually disease terms that need fixing or we need to make a decision on how to handle this”.
Either way, it looks like the identification of duplicate labels might not be failproof at this stage and might benefit from looking more closely in case there are more. @zoependlington kindly volunteered to generate a list of type b) above with a SPARQL query and copy it here.
Note: in the case of 'dementia', the duplicate labels are identical; but there may be cases where labels differ in capitalisation or presence/absence of hyphens.