geneontology / pathways2GO

Code for converting between BioPAX pathways and Gene Ontology Causal Activity Models (GO-CAM)
8 stars 0 forks source link

Reactome: EWASes to be deleted #145

Closed nataled closed 2 years ago

nataled commented 2 years ago

I've generated a list of Reactome entries that, based on conversation in #144, could be deleted from Reactome. Each of these represents an 'alternative view' to another entry. For example, in R-HSA-2997644, the 'target' BRCA1 is modified at position 109 by the 'modifier' SUMO1, the latter of which is attached at its C-terminal end, position 97. This is how we typically think of a SUMO-modified protein. However, because SUMO1 is also a protein, there is another entry, R-HSA-3730735, that shows the same entity from the complementary/reciprocal SUMO1 viewpoint; that is, SUMO1 is modified at position 97 by BRCA1, which is attached at its position 109. As these show precisely the same entity, the latter is redundant.

In the attached file reactome_find_ubl_complements.txt there are cases where the reciprocal relationship is a FULL_MATCH, cases where it is a PARTIAL_MATCH, and cases where it is an IMPRECISE_MATCH. To describe these, I'll use the following terminology: (1) the 'primary' EWAS is the one described above as 'typical'--it is the one in which SUMO (and other modifier proteins) is attached to some primary target protein; and (2) the 'secondary' EWAS is the one that shows the perspective from the modifier point of view (SUMO and other modifier proteins). With the above in mind:

  1. A full match is one where the primary and secondary fully agree, and thus describe the exact same entity. In all such cases the primary has only a single modification, and thus only a single secondary is needed to describe the alternative viewpoint.

  2. A partial match is one where the primary has multiple secondaries; this arises when the primary has more than one modified position. In such cases, each modified position is described by its own secondary EWAS. (Note that in all such cases, the secondary EWAS is not properly described, since it is presented as if the primary molecule was unmodified elsewhere.)

  3. An imprecise match is one in which the primary EWAS uses an isoform identifier (e.g., '-1') as the modified target, but the reciprocal secondary EWAS refers to its version of that primary protein as if it was not an isoform.

There are a few other cases that could not be automatically mapped and were therefore manually checked. These are at the bottom of the file.

In all cases above, no attempt was made to verify that the compartments matched. This is because, with only two exceptions, there was only one entry for each viewpoint. The two exceptions can be found by search for a '/'

Below I describe the syntax used in the file. It's somewhat complex--and can be ignored if desired--but since there's useful information there I included it. The complicated part is presented after the '==' on each line. That part shows the modifications and the protein being modified. For example:

== P63165 +97=MOD:01149+97=CHEBI:30777+97=UniProt:P38398[109] vs P38398 +109=MOD:01149+109=CHEBI:24411+109=UniProt:P63165[97]

The part before 'vs' refers to the point of view of the modifier protein; that is, the 'secondary' case. It begins with the accession of (after the '=='), in this case, it refers to SUMO1. The number after the '+' is the position of modification and the identifier after the '=' is the modification. The position is presented for every modification indicated at that position, so if it takes three 'modifications' to fully describe what, in effect, is a single modification, then there will be three modifications shown for the one position. Thus, when looking at SUMO1 as in the example above, the C-terminal glycine has three sub-modifications: (1) MOD:01149 (sumoylated lysine); (2) CHEBI:30777 (lysyl group); and (3) UniProt:P38398[109] (which show that position 109 of BRCA1 is attached to the SUMO1 protein). Note that the first two modifications shown are very obviously incorrect since the modified position is a glycine from this point of view; this seems to be a common problem for these entries and further illustrates why these should be deleted. After the 'vs' you'll see the EWAS from the point of view of the primary protein. Note that the same point of attachment (+109) is shown as was given in brackets when viewed from the secondary protein perspective. The first sub-modification is MOD:01149 (sumoylated lysine) which, in this case, is correct (that is, position 109 of BRCA1 is a lysine that is being sumoylated). The second is CHEBI:24411 (glycyl group) which, again, is correct (because the position of SUMO1 that is specifically attached is a glycine. Finally, the UniProtKB accession for SUMO1 is given, and it shows that the modifier protein is attached at its position 97 (glycine). Unlike the secondary EWAS, this is described correctly.

For all cases of PARTIAL_MATCH, there will be lines that duplicate the primary. In the last set of partials in the file, R-HSA-4755500 is shown four times. This is because that primary has four SUMO1 modifications. Note that these have all been separated out so that the correct complements can be determined and assessed. For completeness, below is given the actual string to describe R-HSA-4755500:

Q00653 +90=MOD:01149+90=CHEBI:24411+90=UniProt:P63165[97]+298=MOD:01149+298=CHEBI:24411+298=UniProt:P63165[97]+689=MOD:01149+689=CHEBI:24411+689=UniProt:P63165[97]+863=MOD:01149+863=CHEBI:24411+863=UniProt:P63165[97]

This reveals that there are four SUMO1 modifications, at positions 90, 298, 689, and 863. Accordingly, there are four lines that map each of these to the corresponding secondary.

This is a lot to unpack, I'm afraid, so if a call is desired we can arrange one.

deustp01 commented 2 years ago

@nataled An item for later this week.

deustp01 commented 2 years ago

I've converted Darren's file reactome_find_ubl_complements.txt to a Google doc and put it in Gene Ontology > GO-CAM and Noctua > Reactome and Pathway Mapping > Reactome2GO > David_and_Peter_notes > PRO_Reactome. I'll work on that version (which we all should have edit access to).

deustp01 commented 2 years ago

I’ve worked through the whole list, reactome_find_ubl_complements.txt, and marked up the Google Doc version of the list to show the results.

As far as I can tell, all of these instances were created at a point when we did not have the tools in Reactome to annotate a modified residue of a target protein where the modifying group was itself a small protein like SUMO. Instead we created one or more SUMO instances with modified residues (choosing something plausible from the psiMod ontology) and an instance of the target protein similarly modified, then assembled all of these instances into a complex to approximate the attachment of one or more SUMOs to the target protein. The cases you’ve labeled full matches are ones where one SUMO is associated with a target protein; the ones labeled partial matches are ones where two or more SUMOs are associated with a single target protein.

In all or almost all cases, (“should be deleted” is highlighted in green) the annotation has now been re-done so the complex has been replaced with a target protein instance that has one or more side chains modified by the attachment of a SUMO group, and as a result the EWAS instances you have listed and the complex instances that they are parts of are no longer used in any Reactome reactions. The do not need PRO IDs and in fact should be removed from our central database, once two final sanity checks are done. First, Bruce May, the curator who did the original annotation and the re-annotation, should confirm that I’ve understood the history and current state of annotation correctly Second, there are two full match cases, and one group of partial match cases, where it looks like the old-style EWASs and complexes are still in use (“should be deleted” is highlighted in yellow). Bruce, could you look at these? Are they items missed in the clean-up that can now be cleaned up like all the others, or is something else going on here?

Likewise in the complex cases at the end of the list, in three cases it looks like old-style EWASs and complexes can safely be deleted if Bruce agrees (green shading) and in one case, Bruce’s advice on the current status of the instances is needed (yellow shading).

Bruce, if you can check for correctness, I can easily make the deletions in gk_central.

nataled commented 2 years ago

From Bruce May, via email:

We have finished re-annotating SUMO-modified proteins. The new annotations of sumoylated lysines use GroupModifiedResidues containing SUMO Polymers. Previously, complexes containing interchain crosslinked target proteins and SUMOs were used. The complexes have been deleted and the modified SUMO components of the complexes have been deleted. A total of 313 modified SUMOs were deleted. I have compared your list of problematic entities with the updated version of our database. ModifiedSUMO-DeletionsVsDarrensList-220129-1.txt All but 3 entities on your list have been deleted. (See the attached spreadsheet.) The 3 remaining entities are SUMO proteins that are conjugated to E1 (UBA2) or E2 (UBE2I) proteins via thioester bonds between the terminal glycine of the SUMO and a cysteine of the E1 or E2.

nataled commented 2 years ago

My response, via email:

This is very good, thanks! I do have a question regarding the three that were not deleted, taking one as an example. Does R-HSA-3730616 ("UBA2-G97-SUMO1 [nucleoplasm]"), whose modification is described as "Inter-chain Crosslink via S-(glycyl)-L-cysteine (Cys-Gly) at 97 and 173", differ from R-HSA-3730628 ("SUMO1-C173-UBA2 [nucleoplasm]"), whose modification is described as "Inter-chain Crosslink via S-(glycyl)-L-cysteine (Cys-Gly) at 173 and 97"? So far as I can tell, they are the same overall entity, just described from different viewpoints. The question is more than academic because we are endeavoring to represent each distinct entity only once in PRO, and I'll need to know if the examples I mention are one entity or two. This is not an argument in favor of deletion by the way--even if they are the same entity--because Reactome uses different criteria than PRO for EWASes and these might be perfectly valid as separate entries for Reactome.

deustp01 commented 2 years ago

Good luck!

From: "Darren A. Natale" @.> Reply-To: geneontology/pathways2GO @.> Date: Tuesday, February 1, 2022 at 1:30 PM To: geneontology/pathways2GO @.> Cc: "D'eustachio, Peter" @.>, Assign @.***> Subject: Re: [geneontology/pathways2GO] Reactome: EWASes to be deleted (Issue #145)

[EXTERNAL]

From Bruce May, via email:

We have finished re-annotating SUMO-modified proteins. The new annotations of sumoylated lysines use GroupModifiedResidues containing SUMO Polymers. Previously, complexes containing interchain crosslinked target proteins and SUMOs were used. The complexes have been deleted and the modified SUMO components of the complexes have been deleted. A total of 313 modified SUMOs were deleted. I have compared your list of problematic entities with the updated version of our database. All but 3 entities on your list have been deleted. (See the attached spreadsheet.) The 3 remaining entities are SUMO proteins that are conjugated to E1 (UBA2) or E2 (UBE2I) proteins via thioester bonds between the terminal glycine of the SUMO and a cysteine of the E1 or E2.

— Reply to this email directly, view it on GitHubhttps://github.com/geneontology/pathways2GO/issues/145#issuecomment-1027157762, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBFYV2DIHW6J7ERDEA4WWDUZARDTANCNFSM5MIJ32AA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were assigned.Message ID: @.***>

nataled commented 2 years ago

From Bruce May, via email:

Hi Darren Indeed, R-HSA-3730616 and R-HSA-3730628 refer to the same bond between UBA2 and SUMO1, a thioester linkage between an internal cysteine of UBA2 and the C-terminal glycine of SUMO1. R-HSA-3730616 is the modification on the SUMO1 and R-HSA-3730628 is the modification on the UBA2. We have represented the UBA2-SUMO conjugates and the UBE2I-SUMO conjugates as complexes. Each complex contains a UBA2 or UBE2I entity and a SUMO entity. Here is a list of the complexes, their component entities, and the interchain crosslinks:

SUMO1:UBA2 [nucleoplasm] 2993786 contains SUMO1-C173-UBA2 3730628 with interchain crosslink 2993752 UBA2-G97-SUMO1 3730616 with interchain crosslink 3465551

SUMO2:UBA2 [nucleoplasm] 2993776 contains SUMO2-C173-UBA2 3730612 with interchain crosslink 2993795 UBA2-G93-SUMO2 3730618 with interchain crosslink 3465547

SUMO3:UBA2 [nucleoplasm] 2993765 contains SUMO3-C173-UBA2 3730623 with interchain crosslink 2993788 UBA2-G92-SUMO3 3730627 with interchain crosslink 3465553

SUMO1:C93-UBE2I [cytosol] 4656922 contains SUMO1-C93-UBE2I 4656909 with interchain crosslink 2993792 UBE2I-G97-SUMO1 4656941 with interchain crosslink 3465539

SUMO1:C93-UBE2I [nucleoplasm] 2993783 contains SUMO1-C93-UBE2I 3730617 with interchain crosslink 2993792 UBE2I-G97-SUMO1 3730622 with interchain crosslink 3465539

SUMO2:UBE2I [cytosol] 4656951 contains SUMO2-C93-UBE2I 4656944 with interchain crosslink 2993755 UBE2I-G93-SUMO2 4656928 with interchain crosslink 3465542

SUMO2:UBE2I [nucleoplasm] 2993778 contains SUMO2-C93-UBE2I 3730619 with interchain crosslink 2993755 UBE2I-G93-SUMO2 3730625 with interchain crosslink 3465542

SUMO3:UBE2I [cytosol] 4656955 contains SUMO3-C93-UBE2I 4656938 with interchain crosslink 2993770 UBE2I-G92-SUMO3 4656920 with interchain crosslink 3465536

SUMO3:UBE2I [nucleoplasm] 2993782 contains SUMO3-C93-UBE2I 3730629 with interchain crosslink 2993770 UBE2I-G92-SUMO3 3730611 with interchain crosslink 3465536

Sumo1:C93-Ube2i mouse [nucleoplasm] 3232135 contains Sumo1-C93-Ube2i 3730632 with interchain crosslink 3232132 Ube2i-G97-Sumo1 3730634 with interchain crosslink 3730633

Sumo1:Ube2i rat [nucleoplasm] 3927914 contains Sumo1-C93-Ube2i 3927885 with interchain crosslink 3927905 Ube2i-G97-Sumo1 3927951 with interchain crosslink 3927951

Sumo2:Ube2i mouse [nucleoplasm] 3215019 contains Sumo2-C93-Ube2i 3730635 with interchain crosslink 3214919 Ube2i-G93-Sumo2 3730638 with interchain crosslink 3730636

Sumo3:Ube2i mouse [nucleoplasm] 3232151 contains Sumo3-C93-Ube2i 3730779 with interchain crosslink 3232125 Ube2i-G92-Sumo3 3730772 with interchain crosslink 3730773

Best Regards, Bruce

nataled commented 2 years ago

My resposnse, via email:

Understood. I presume these are treated differently than other SUMO-modified proteins because, in this case, SUMO is actually the target (for activation by UBA1) rather than the modifier.

Peter, if that's the case, then for these I should actually avoid making a PRO term for R-HSA-3730628 (modification on the UBA2). Can you confirm?

broose-may commented 2 years ago

Hi Darren and Peter Yes, that's correct. The UBA2 and UBE2I reactions and complexes activate SUMO1,2,3 via a thioester linkage prior to conjugation of SUMO1,2,3 to the lysine of the target protein.

nataled commented 2 years ago

@broose-may I assume this is how it works for other ubl modifiers as well, such as NEDDylation? I do see some cases like that as well. Would it be possible to get a list of UniProtKB accessions for all such activators? I can then use that list to figure out which version of activator+modifier should be represented in PRO.

broose-may commented 2 years ago

Yes, this is how other ubl modifiers and activators will be annotated. Currently we have a mix of ModifiedResidues and GroupModifiedResidues for ubiquitinylated lysines. They will all be converted to GroupModifiedResidues. The ubiquitin activators (E1 and E2 enzymes) are currently in complexes with ubiquitin. The crosslinks (InterchainCrosslinkedResidues) must be added. The crosslinks for the NEDD8 and ISG15 E1 and E2 enzymes are already present. Here is a list of our current UniProt entities:

Ubiquitin E1 enzymes UniProt:P22314 UBA1 UniProt:A0AVT1 UBA6

Ubiquitin E2 enzymes UniProt:P49427 CDC34 UniProt:P49459 UBE2A UniProt:P63146 UBE2B UniProt:O00762 UBE2C UniProt:P51668 UBE2D1 UniProt:P62837 UBE2D2 UniProt:P51965 UBE2E1 UniProt:Q969T4 UBE2E3 UniProt:P62253 UBE2G1 UniProt:P60604 UBE2G2 UniProt:P62256 UBE2H UniProt:P61086 UBE2K UniProt:P68036 UBE2L3 UniProt:Q8WVN8 UBE2Q2 UniProt:Q712K3 UBE2R2 UniProt:Q16763 UBE2S UniProt:Q9NPD8 UBE2T UniProt:Q96B02 UBE2W UniProt:Q9H832 UBE2Z

NEDD8 E1 enzyme UniProt:Q8TBC4 UBA3

NEDD8 E2 enzymes UniProt:P61081 UBE2M UniProt:Q969M7 UBE2F

ISG15 E1 enzyme UniProt:P41226 UBA7

ISG15 E2 enzyme UniProt:O14933 UBE2L6 (UBCH8)

On Thu, Feb 3, 2022 at 2:06 PM Darren A. Natale @.***> wrote:

@broose-may https://github.com/broose-may I assume this is how it works for other ubl modifiers as well, such as NEDDylation? I do see some cases like that as well. Would it be possible to get a list of UniProtKB accessions for all such activators? I can then use that list to figure out which version of activator+modifier should be represented in PRO.

— Reply to this email directly, view it on GitHub https://github.com/geneontology/pathways2GO/issues/145#issuecomment-1029355982, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXTCZA3JDBLJE65GBBI34E3UZLN6BANCNFSM5MIJ32AA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

nataled commented 2 years ago

(Note to self: Q9UBT2 UBA2 belongs under E1 enzymes above)

broose-may commented 2 years ago

Yes, UBA2 is the E1 enzyme for activating SUMO1,2,3 (though not ubiquitin). UBA2 is in a complex with SAE1.

On Mon, Feb 7, 2022 at 10:58 AM Darren A. Natale @.***> wrote:

(Note to self: Q9UBT2 UBA2 belongs under E1 enzymes above)

— Reply to this email directly, view it on GitHub https://github.com/geneontology/pathways2GO/issues/145#issuecomment-1031694595, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXTCZA3WWEJ2CXKXR6CN773UZ723HANCNFSM5MIJ32AA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

nataled commented 2 years ago

@broose-may thank you. If I understand all this correctly, then there is some mostly-specific interaction between activator protein (UBA1, UBA2, etc) and activated protein (SUMO, NEDD8, etc), and the specificity is thus:

Activated modifier  E1 Activator(s)     E2 activator(s)
Ubiquitin       P22314=UBA1     P49427=CDC34
            A0AVT1=UBA6     P49459=UBE2A
                        P63146=UBE2B
                        O00762=UBE2C
                        P51668=UBE2D1
                        P62837=UBE2D2
                        P51965=UBE2E1
                        Q969T4=UBE2E3
                        P62253=UBE2G1
                        P60604=UBE2G2
                        P62256=UBE2H
                        P61086=UBE2K
                        P68036=UBE2L3
                        Q8WVN8=UBE2Q2
                        Q712K3=UBE2R2
                        Q16763=UBE2S
                        Q9NPD8=UBE2T
                        Q96B02=UBE2W
                        Q9H832=UBE2Z

SUMO1           Q9UBT2=UBA2     P63279=UBE2I
SUMO2           Q9UBT2=UBA2     P63279=UBE2I
SUMO3           Q9UBT2=UBA2     P63279=UBE2I

NEDD8           Q8TBC4=UBA3     P61081=UBE2M
                        Q969M7=UBE2F

ISG15           P41226=UBA7     O14933=UBE2L6 (UBCH8)

Thus, activation of SUMO1 would not happen with, say, UBE2Z.

broose-may commented 2 years ago

Hi Darren Exactly.

On Mon, Feb 7, 2022 at 1:49 PM Darren A. Natale @.***> wrote:

@broose-may https://github.com/broose-may thank you. If I understand all this correctly, then there is some mostly-specific interaction between activator protein (UBA1, UBA2, etc) and activated protein (SUMO, NEDD8, etc), and the specificity is thus:

Activated modifier E1 Activator(s) E2 activator(s) Ubiquitin P22314=UBA1 P49427=CDC34 A0AVT1=UBA6 P49459=UBE2A P63146=UBE2B O00762=UBE2C P51668=UBE2D1 P62837=UBE2D2 P51965=UBE2E1 Q969T4=UBE2E3 P62253=UBE2G1 P60604=UBE2G2 P62256=UBE2H P61086=UBE2K P68036=UBE2L3 Q8WVN8=UBE2Q2 Q712K3=UBE2R2 Q16763=UBE2S Q9NPD8=UBE2T Q96B02=UBE2W Q9H832=UBE2Z

SUMO1 Q9UBT2=UBA2 P63279=UBE2I SUMO2 Q9UBT2=UBA2 P63279=UBE2I SUMO3 Q9UBT2=UBA2 P63279=UBE2I

NEDD8 Q8TBC4=UBA3 P61081=UBE2M Q969M7=UBE2F

ISG15 P41226=UBA7 O14933=UBE2L6 (UBCH8)

Thus, activation of SUMO1 would not happen with, say, UBE2Z.

— Reply to this email directly, view it on GitHub https://github.com/geneontology/pathways2GO/issues/145#issuecomment-1031853663, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXTCZAZTTYVLDK36MM6GEMLU2AO3VANCNFSM5MIJ32AA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

nataled commented 2 years ago

With the additional information provided regarding the activators, I've revisited my 'can delete' list. I knew that after the first round there were some missed cases (I failed to account for one type of variation). There are only a few handfuls on this new list reactome_find_ubl_complements_more.txt

Once accounted for, this ticket can be closed.

broose-may commented 2 years ago

Hi Darren The 7 entities have been deleted. I think the ticket can be closed.

On Wed, Feb 9, 2022 at 4:35 PM Darren A. Natale @.***> wrote:

With the additional information provided regarding the activators, I've revisited my 'can delete' list. I knew that after the first round there were some missed cases (I failed to account for one type of variation). There are only a few handfuls on this new list reactome_find_ubl_complements_more.txt https://github.com/geneontology/pathways2GO/files/8036640/reactome_find_ubl_complements_more.txt

Once accounted for, this ticket can be closed.

— Reply to this email directly, view it on GitHub https://github.com/geneontology/pathways2GO/issues/145#issuecomment-1034268136, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXTCZA45DIORBSXZCEKEAKTU2LTZLANCNFSM5MIJ32AA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>