Closed nataled closed 2 years ago
I'm adding the following to this list because the difference between each pair is a matter of different casing for a single letter in the name:
R-HSA-8948441 GO:0005654 ENSG00000184557 1 -1 SOCS3 gene [nucleoplasm]
R-HSA-8848151 GO:0005654 ENSG00000184557 1 -1 SOCS3 Gene [nucleoplasm]
R-HSA-4568743 GO:0005654 P68431 2 136 +15=MOD:00064 AcK15-HIST1H3A [nucleoplasm]
R-HSA-4549217 GO:0005654 P68431 2 136 +15=MOD:00064 Ack15-HIST1H3A [nucleoplasm]
R-HSA-4568755 GO:0005654 Q71DI3 2 136 +15=MOD:00064 AcK15-HIST2H3A [nucleoplasm]
R-HSA-4549224 GO:0005654 Q71DI3 2 136 +15=MOD:00064 Ack15-HIST2H3A [nucleoplasm]
R-HSA-6797760 GO:0005654 ENSG00000089685 1 -1 BIRC5 Gene [nucleoplasm]
R-HSA-8948424 GO:0005654 ENSG00000089685 1 -1 BIRC5 gene [nucleoplasm]
Also should add the following cases that appear to be identical except for the presence/absence of a '-' in the name:
R-HSA-427744 GO:0005654 Q71DI3 2 136 +10=MOD:00083 Me3K-10-HIST2H3A [nucleoplasm]
R-HSA-4754191 GO:0005654 Q71DI3 2 136 +10=MOD:00083 Me3K10-HIST2H3A [nucleoplasm]
R-HSA-4754188 GO:0005654 P68431 2 136 +10=MOD:00083 Me3K10-HIST1H3A [nucleoplasm]
R-HSA-427734 GO:0005654 P68431 2 136 +10=MOD:00083 Me3K-10-HIST1H3A [nucleoplasm]
R-HSA-4724280 GO:0005654 Q71DI3 2 136 +10=MOD:00084 Me2K10-HIST2H3A [nucleoplasm]
R-HSA-427407 GO:0005654 Q71DI3 2 136 +10=MOD:00084 Me2K-10-HIST2H3A [nucleoplasm]
R-HSA-212253 GO:0005654 Q71DI3 2 136 +28=MOD:00083 Me3K-28-HIST2H3A [nucleoplasm]
R-HSA-4754178 GO:0005654 Q71DI3 2 136 +28=MOD:00083 Me3K28-HIST2H3A [nucleoplasm]
R-HSA-4754169 GO:0005654 P68431 2 136 +28=MOD:00083 Me3K28-HIST1H3A [nucleoplasm]
R-HSA-212220 GO:0005654 P68431 2 136 +28=MOD:00083 Me3K-28-HIST1H3A [nucleoplasm]
Sorry, one more of this type (identical, or identical except for a trivial difference). This time it's an extra space in one vs the other.
R-HSA-60306 GO:0005829 P30419 1 496 NMT 1 [cytosol]
R-HSA-2649002 GO:0005829 P30419 1 496 NMT1 [cytosol]
Progress so far: all pairs (and the one trio) of EWASs flagged above as duplicates (triplicates) have been assembled into a Google spreadsheet (david_and_peter_notes > PRO-Reactome > GitHub_155_reactome_EWAS_identical. All flagged items truly are duplicates and in all but two cases the duplicate instances have been merged into single ones as indicated by colors on the spreadsheet entries: the green-shaded instance has been kept while the pink one has been deleted and all other instances (reactions, sets, complexes) that pointed to it have been re-directed to the one that has been kept.
The two cases not yet fixed (grayish shading on the spreadsheet) were created because a curator wanted to annotate the function of a protein that is activated by a conformational change, and Reactome lacks tools to distinguish alternate conformations of the identical polypeptide. Specifically, one of the subunits of a protein heterotrimer has kinase activity when phosphorylated and not otherwise. However, while the phosphorylation induces conformational changes in all three subunits, not only the one with kinase activity, we don't need to capture that information to distinguish the two forms of the heterotrimer and to associate the phospho-form with the activity. This allows the last two pairs of duplicates to be cleaned up.
Notes for a future re-annotation of AMPK heterotrimer complexes to provide a single consistent Reactome view: The problem reaction (now patched) was R-HSA-200423 and the complex that catalyzes it, AMPK heterotrimer (active) [cytosol]. Problem-free reaction and complex are R-HSA-9619515 and p-AMPK heterotrimer [cytosol] (from Marija Milacik) and the older pair from Steve Jupe on which she modeled hers: reaction R-HSA-200421 and complex R-HSA-380934. Relevant literature includes Hurley et al. 2005, Thornton et al. 2011 and Mairet-Coello et al. 2013. Questions for the future:
The issue of conformational change comes up again in #156, and regardless of how the issue of duplicates gets resolved at Reactome we'll still need to discuss how to handle such things in PRO.
Though I'm not yet set up for it, the detection of duplicate complexes and sets that arise due to this cleanup will be an automatic outcome from my processing, so if you are okay with waiting I can provide a list of suspicious cases at the relevant time in the future (as opposed to you detecting them *shudder*
by hand).
so if you are okay with waiting
Still working on the backlog (and there's a Rhea backlog) so I'm fine with waiting.
conformational change
If the total number of cases is small then, as in the AMPK case here, I'd like to find a re-annotation that does not require either Reactome or PRO to keep two instances of something that look identical according to our data models, even if there are safe ways to do the keeping.
detection of duplicate complexes
NOT by hand, for sure. We also have QA scripts that should do this, that have been misled so far because set one was composed of the first member of each duplicate par while set 2 was composed of the second member, so the sets look different. Between our scripts, now that we will be feeding them more accurate information, and yours, we should be able to avoid manual searching.
Agreed. We can discuss exactly how it could be done at a later date, but one idea I have (so I don't forget) is to come up with (or create) ontological terms that indicate states such as 'activated' or 'unfolded' etc.
Clean-up of all duplicate EWASs on this ticket is DONE!
Woohoo!
There are a number of EWAS (protein or otherwise) that seem to be fully identical except for Reactome identifier. A few examples:
There are about 100 such sets. Other than the Reactome identifer, these (at least in the few cases I checked) seem to differ only with respect to the reactions they are annotated to. I know not all of these are in scope for PRO, but likely these should be looked at regardless, as some of these might be truly redundant while others might need to be revised. See attached tab-separated file. reactome_EWAS_identical.txt