geneontology / pathways2GO

Code for converting between BioPAX pathways and Gene Ontology Causal Activity Models (GO-CAM)
8 stars 0 forks source link

Reactome: normalization of GPI-anchored proteins #153

Closed nataled closed 2 years ago

nataled commented 2 years ago

My review of GPI-anchored proteins reveals four methods used to specify the anchor:

  1. MOD only, using MOD:00818 "glycosylphosphatidylinositolated residue"
  2. MOD only, using a MOD for yl-glycosylphosphatidylinositolethanolamine (eg N-seryl-glycosylphosphatidylinositolethanolamine); I'll abbreviate as aa-yl-GPI-ethanolamine.
  3. MOD+CHEBI, using MOD:00818 "glycosylphosphatidylinositolated residue" + CHEBI:24410 "glycosylphosphatidylinositol"
  4. MOD+CHEBI, using a MOD for amino-acid amide + CHEBI:63419 "acyl-GPI" (mentioned in issue #149)

The EWAS in group 1 don't specify which amino acid is GPI-attached. I would argue that they could be made more specific. If they stay under the MOD:00818 hierarchy, they would resemble group 2 since the all the aa-yl-GPI-ethanolamine types are children of MOD:00818. However, I don't actually recommend this (see below).

The EWAS in group 2 have the requisite specificity in terms of amino acid, but as a standard mechanism it is too inflexible, as all GPI anchors are not structurally the same. I initially had a difficult time trying to figure out if these differed in some way from 'regular' GPI, as the name implies GPI+ethanolamine. However, given that these are under the MOD for 'regular' GPI, I believe that these are actually just GPI-attached-to-whichever amino acid, and that the ethanolamine part of the name is not intended as an extra moiety, but rather as the specific part of the GPI that's attached. That is to say, group 2 should be read as (for example) "serine that is attached to the ethanolamine moiety of GPI." Thus, I believe group 2 does refer to 'regular' GPI.

The EWAS in group 3 are the ones I called 'redundant' in issue #151. These have the same issue as group 1 in the sense that they too can be made more specific with respect to amino acid bearing the anchor. These could be revised to include the more specific amino acid as described for group 1, and the CHEBI part would provide the GPI specificity.

The EWAS in group 4 use a mechanism that potentially provides specificity at both the amino acid and GPI levels. GPI anchors are attached to the C terminus of proteins using an amide linkage, and this mechanism captures that aspect. Note that the CHEBI part could refer to 'regular' GPI (CHEBI:24410 "glycosylphosphatidylinositol"), or to the acylated form (CHEBI:63419 "acyl-GPI") commonly seen in erythrocytes, or something else. Indeed, there is a class of cleaved-GPI proteins that also use the amino-acid amide + CHEBI notation, making this the predominant form used in Reactome.

I'm not really sure which mechanism to recommend. For sure I suggest using a MOD+CHEBI mechanism for flexibility. Beyond that, I think it comes down to two possibilities. Both possibilities rely on the CHEBI part to provide the GPI specificity (either GPI, acyl-GPI, or other). The difference lies within the MOD part used to provide amino acid specificity:

The benefit to first version is that it more easily translates to the user that there's a GPI attachment, and is thus consistent with the way most other MOD+CHEBI EWASes are described; that is, you know the amino acid that's modified, and you know (at least in general) what type of modification it is. The CHEBI part then indicates the specific modification of that type. The pitfall is that--as happened to me--there could be confusion as to what exactly that modification type is (recall that I initially interpreted it to mean GPI with an extra ethanolamine). The GPI-ethanolamine nomenclature does not seem to have widespread use. Note, however, that this form more closely matches the form used for, say, SUMOylation, where the specific amino acid modification is given via MOD, then the group is given via CHEBI, then the specificity of what that group is a part of is given via (usually) UniProt. In this case the modification and group is given by the MOD, and the specificity of what that group is part of is given via CHEBI.

The benefit to the second version is that it is widely known that GPI anchors attach to proteins using an amide linkage. The pitfall is that there's no indication in the MOD part what the modification type is. This version also lacks information as to the group involved in the attachment.

Thoughts?

deustp01 commented 2 years ago

Thought more about the issues raised here and in GitHub tickets #152 and #149 , and have a unified fix for all three sets of problems, centered on ChEBI 143797, GPI-anchor amidated amino acid carboxyl end residue(1−)

Screen Shot 2022-03-14 at 4 49 50 PM

. The biochemistry of all of the covalently modified proteins in these tickets is that a protein is cleaved near its carboxyterminus and the moiety shown here is added to the new carboxyterminal residue in an amide linkage: the protein is at position R9 at the bottom of the chemical structure as drawn here. This moiety anchors the protein in a lipid bilayer membrane (typically Golgi membrane initially, which may be translocated via a vesicle to the plasma membrane). The membrane is above the top of the chemical structure as drawn here; the structure is associated with the membrane via acyl groups at one or more of positions R1, R2, and R3.

To annotate these proteoforms in Reactome, we can make groupModifiedResidue instances whose attributes are the number of the residue in the cleaved protein that is modified, ChEBI:143797 as the modification, and the appropriate amino acid amide as the psiMOD term (e.g., psiMOD:00105 serine amide). This annotation omits information about the exact number and identity of the anchoring fatty acid chains but should always be correct and, except for this one feature, should be complete. The lost information is a problem (to be solved in the future) if we want to annotate the final steps of the generation of a membrane-associated GPI-anchored protein as described in the paper Darren cited by by Kinoshita (2020 - PMID: 32156170) but should be sufficient for now, to allow us to correctly identify and distinguish GPI-modified proteoforms.

So if all of this is OK, then the uniform fix is to create groupModifiedResidue instances following the scheme outlined here to replace all of the modified residues flagged as irregular or incomplete in all of these tickets. That is now done so I am tentatively closing all three tickets.