geneontology / pathways2GO

Code for converting between BioPAX pathways and Gene Ontology Causal Activity Models (GO-CAM)
8 stars 0 forks source link

Reactome: identical EWAS #155

Closed nataled closed 2 years ago

nataled commented 2 years ago

There are a number of EWAS (protein or otherwise) that seem to be fully identical except for Reactome identifier. A few examples:

R-HSA-6801535   GO:0005576      P25774  115     331                     CTSS [extracellular region]
R-HSA-2228666   GO:0005576      P25774  115     331                     CTSS [extracellular region]

R-HSA-6782524   GO:0005654      P0CG48  1       76      +48=MOD:01148           Ub-48-UBC(1-76) [nucleoplasm]
R-HSA-5683033   GO:0005654      P0CG48  1       76      +48=MOD:01148           Ub-48-UBC(1-76) [nucleoplasm]

R-HSA-8938235   GO:0005654      ENSG00000164400 1       -1                      CSF2 gene [nucleoplasm]
R-HSA-6785023   GO:0005654      ENSG00000164400 1       -1                      CSF2 gene [nucleoplasm]

R-HSA-5687589   GO:0005886      O95436  1       690     +76=MOD:01637           SLC34A2 Q76* [plasma membrane]
R-HSA-5651668   GO:0005886      O95436  1       690     +76=MOD:01637           SLC34A2 Q76* [plasma membrane]

There are about 100 such sets. Other than the Reactome identifer, these (at least in the few cases I checked) seem to differ only with respect to the reactions they are annotated to. I know not all of these are in scope for PRO, but likely these should be looked at regardless, as some of these might be truly redundant while others might need to be revised. See attached tab-separated file. reactome_EWAS_identical.txt

nataled commented 2 years ago

I'm adding the following to this list because the difference between each pair is a matter of different casing for a single letter in the name:

R-HSA-8948441   GO:0005654      ENSG00000184557 1       -1                      SOCS3 gene [nucleoplasm]
R-HSA-8848151   GO:0005654      ENSG00000184557 1       -1                      SOCS3 Gene [nucleoplasm]

R-HSA-4568743   GO:0005654      P68431  2       136     +15=MOD:00064           AcK15-HIST1H3A [nucleoplasm]
R-HSA-4549217   GO:0005654      P68431  2       136     +15=MOD:00064           Ack15-HIST1H3A [nucleoplasm]

R-HSA-4568755   GO:0005654      Q71DI3  2       136     +15=MOD:00064           AcK15-HIST2H3A [nucleoplasm]
R-HSA-4549224   GO:0005654      Q71DI3  2       136     +15=MOD:00064           Ack15-HIST2H3A [nucleoplasm]

R-HSA-6797760   GO:0005654      ENSG00000089685 1       -1                      BIRC5 Gene [nucleoplasm]
R-HSA-8948424   GO:0005654      ENSG00000089685 1       -1                      BIRC5 gene [nucleoplasm]
nataled commented 2 years ago

Also should add the following cases that appear to be identical except for the presence/absence of a '-' in the name:

R-HSA-427744    GO:0005654      Q71DI3  2       136     +10=MOD:00083           Me3K-10-HIST2H3A [nucleoplasm]
R-HSA-4754191   GO:0005654      Q71DI3  2       136     +10=MOD:00083           Me3K10-HIST2H3A [nucleoplasm]

R-HSA-4754188   GO:0005654      P68431  2       136     +10=MOD:00083           Me3K10-HIST1H3A [nucleoplasm]
R-HSA-427734    GO:0005654      P68431  2       136     +10=MOD:00083           Me3K-10-HIST1H3A [nucleoplasm]

R-HSA-4724280   GO:0005654      Q71DI3  2       136     +10=MOD:00084           Me2K10-HIST2H3A [nucleoplasm]
R-HSA-427407    GO:0005654      Q71DI3  2       136     +10=MOD:00084           Me2K-10-HIST2H3A [nucleoplasm]

R-HSA-212253    GO:0005654      Q71DI3  2       136     +28=MOD:00083           Me3K-28-HIST2H3A [nucleoplasm]
R-HSA-4754178   GO:0005654      Q71DI3  2       136     +28=MOD:00083           Me3K28-HIST2H3A [nucleoplasm]

R-HSA-4754169   GO:0005654      P68431  2       136     +28=MOD:00083           Me3K28-HIST1H3A [nucleoplasm]
R-HSA-212220    GO:0005654      P68431  2       136     +28=MOD:00083           Me3K-28-HIST1H3A [nucleoplasm]
nataled commented 2 years ago

Sorry, one more of this type (identical, or identical except for a trivial difference). This time it's an extra space in one vs the other.

R-HSA-60306     GO:0005829      P30419  1       496                     NMT 1 [cytosol]
R-HSA-2649002   GO:0005829      P30419  1       496                     NMT1 [cytosol]
deustp01 commented 2 years ago

Progress so far: all pairs (and the one trio) of EWASs flagged above as duplicates (triplicates) have been assembled into a Google spreadsheet (david_and_peter_notes > PRO-Reactome > GitHub_155_reactome_EWAS_identical. All flagged items truly are duplicates and in all but two cases the duplicate instances have been merged into single ones as indicated by colors on the spreadsheet entries: the green-shaded instance has been kept while the pink one has been deleted and all other instances (reactions, sets, complexes) that pointed to it have been re-directed to the one that has been kept.

The two cases not yet fixed (grayish shading on the spreadsheet) were created because a curator wanted to annotate the function of a protein that is activated by a conformational change, and Reactome lacks tools to distinguish alternate conformations of the identical polypeptide. Specifically, one of the subunits of a protein heterotrimer has kinase activity when phosphorylated and not otherwise. However, while the phosphorylation induces conformational changes in all three subunits, not only the one with kinase activity, we don't need to capture that information to distinguish the two forms of the heterotrimer and to associate the phospho-form with the activity. This allows the last two pairs of duplicates to be cleaned up.

Notes for a future re-annotation of AMPK heterotrimer complexes to provide a single consistent Reactome view: The problem reaction (now patched) was R-HSA-200423 and the complex that catalyzes it, AMPK heterotrimer (active) [cytosol]. Problem-free reaction and complex are R-HSA-9619515 and p-AMPK heterotrimer [cytosol] (from Marija Milacik) and the older pair from Steve Jupe on which she modeled hers: reaction R-HSA-200421 and complex R-HSA-380934. Relevant literature includes Hurley et al. 2005, Thornton et al. 2011 and Mairet-Coello et al. 2013. Questions for the future:

  1. Does a single AMPK heterotrimer whose components are sets of all possible alpha, beta, and gamma subunits (and an activated version with phospho-alpha subunits) capture all the relevant information, or do we need versions of the complex with individual alpha, beta, and gamma EWASs as subunits to annotate different AMPK-mediated events in different pathways?
  2. Should the p-AMPK heterotrimer have component "p-AMPK alpha" (set or EWAS) as its active unit?
nataled commented 2 years ago

The issue of conformational change comes up again in #156, and regardless of how the issue of duplicates gets resolved at Reactome we'll still need to discuss how to handle such things in PRO.

Though I'm not yet set up for it, the detection of duplicate complexes and sets that arise due to this cleanup will be an automatic outcome from my processing, so if you are okay with waiting I can provide a list of suspicious cases at the relevant time in the future (as opposed to you detecting them *shudder* by hand).

deustp01 commented 2 years ago

so if you are okay with waiting

Still working on the backlog (and there's a Rhea backlog) so I'm fine with waiting.

conformational change

If the total number of cases is small then, as in the AMPK case here, I'd like to find a re-annotation that does not require either Reactome or PRO to keep two instances of something that look identical according to our data models, even if there are safe ways to do the keeping.

detection of duplicate complexes

NOT by hand, for sure. We also have QA scripts that should do this, that have been misled so far because set one was composed of the first member of each duplicate par while set 2 was composed of the second member, so the sets look different. Between our scripts, now that we will be feeding them more accurate information, and yours, we should be able to avoid manual searching.

nataled commented 2 years ago

Agreed. We can discuss exactly how it could be done at a later date, but one idea I have (so I don't forget) is to come up with (or create) ontological terms that indicate states such as 'activated' or 'unfolded' etc.

deustp01 commented 2 years ago

Clean-up of all duplicate EWASs on this ticket is DONE!

nataled commented 2 years ago

Woohoo!