geneontology / pathways2GO

Code for converting between BioPAX pathways and Gene Ontology Causal Activity Models (GO-CAM)
8 stars 0 forks source link

Reactome: indistinguishable EWAS (discussion needed) #157

Closed nataled closed 10 months ago

nataled commented 2 years ago

The processing done to convert Reactome EWAS to PRO makes the assumption that the 'same' proteoform will differ only with respect to location. There is, however, another assumption, and that is that each EWAS is accurately described using UniProtKB identifier plus sequence range plus amino acid modifications (that is, the features that PRO uses to distinguish one proteoform from another). Put another way, every proteoform is considered 'different' from others if there's a difference in any one of those features and, conversely, are considered the 'same' if no such difference exists. I therefore attempted to find cases where there might be two or more EWASes that 'look the same' with respect to those distinguishing features, but are actually different. Such cases will need to be discussed to determine, for example, whether or not different PRO terms are desired.

In the attached file reactome_EWAS_indistinguishable.discuss.txt each line will start with either ACTIVATION or FOLDING. An explanation of each type follows. Remember, based on UniProt+range+modifications, all members of each set are indistinguishable to PRO (so far).

ACTIVATION: one or more EWAS has a name that indicates the proteoform is activated in some way, or is the active form. We need to discuss whether or not it is important for PRO to capture that information and, if so, how to ensure it can happen.

R-HSA-419058    GO:0005829  O75116  1   1388            ROCK2 [cytosol]
R-HSA-4687775   GO:0005829  O75116  1   1388            Activated ROCK2 [cytosol]

FOLDING: one or more EWAS has a name that indicates differences in folding or conformation. Again, this needs to be discussed as to importance of representation.

R-HSA-6814204   GO:0005829  O14775  1   395         Unfolded GNB5 [cytosol]
R-HSA-6814416   GO:0005829  O14775  1   395         GNB5 [cytosol]
R-HSA-8850546   GO:0005829  O14775  1   395         Partially folded GNB5 [cytosol]

At a minimum I think we'll need input from @deustp01 and @ukemi. Please tag others as you see fit.

nataled commented 2 years ago

My take: these should be different PRO terms (for example, 'unfolded GNB5' is not the same entity as 'GNB5'). I say this because, in all the years I've looked, I've found only a single example (which, unfortunately, I cannot remember) where a conformational change or an activated form did NOT result from some other change. That is, usually something else causes the different conformation or activation state, typically a post-translational modification or binding of a ligand. Whatever the cause, there is a change of state to the protein. This is in contrast to a change in location of a protein whose state remains the same from one location to another (which we've already decided to treat as a single entity). If we decide to take this approach, we'll need to answer at least the following questions:

1) What if we know there's a conformational change (for example), but don't know the cause? Opinion: Let's say what we call 'unfolded GNB5' is actually GNB5 with some specific phosphorylation, but we don't know that. If we make 'unfolded GNB5' as a child of GNB5, we can always revise later with more information. So, if we later on discover the phosphorylation connection, we call the entity 'GNB5 phosphorylated ', define it as we would for any other phosphorylated proteoform, and make 'unfolded GNB5' a synonym.

2) How do we recognize that there are different terms needed? Opinion: I'd like to not have to rely on names to impart important information. One possible way to obtain the needed information in a structured way is to make use of an ontology term, similar to what is currently done with MOD and CHEBI. Unclear which ontology should house the term. For the term itself, as a straw man, I would suggest "modification that causes an unfolded conformation" (with, of course, related terms for folded, activated, etc). I start with this suggestion because it will allow Reactome to use a slot already dedicated to modifications. Declaring it a modification is definitely too strong, however, since we already know conformations can change for other reasons, so this needs work and further thought.

3) What to do about transitional states? While I don't know if 'partially folded GNB5' represents a transition state or something stable, for the sake of argument let us say it's an unstable transition state. Do we represent it as a distinct form? Opinion: No, we should only represent stable states, in the same way that GO (for example) doesn't represent transient protein associations as 'complexes'.

nataled commented 1 year ago

As I clear out other issues, the remaining ones loom larger. That's the case with this issue. I need to figure out what to do with these. The biggest problem is that there is currently no standardized way to find them; purely relies on entity name. The outcome, at the moment, is that the following will all map to a single entity, PR:O14775 (canonical GNB5):

Reactome:R-HSA-6814204 "Unfolded GNB5 [cytosol]" Reactome:R-HSA-6814416 "GNB5 [cytosol]" <-- uncontroversial mapping to canonical Reactome:R-HSA-8850546 "Partially folded GNB5 [cytosol]"

ukemi commented 1 year ago

Until we start distinguishing topologically differentiated proteins at the GO-CAM level, they should all map to the same identifier.

deustp01 commented 1 year ago

The outcome, at the moment, is that the following will all map to a single entity, PR:O14775 (canonical GNB5):

I think that's the only choice - all map to a single PRO ID. There is nothing in the Reactome annotation except those free text phrases, "unfolded" and "partially folded", and that's not the basis of a PRO term definition. I hope the many-to-one mapping will behave in the same way as the many-to-ones that result from locating an EWAS defined by its UniProt ID and covalent modifications, in more than one compartment. And Noctua / GO-CAM will tolerate this because there is no requirement for the input and output of a reaction to be somehow different - the reaction in which we say that P12345 unfolds will transform into P12345 => P12345 which looks silly to us but is logically OK.

Is this OK?

nataled commented 1 year ago

I'm fine either way. Mapping all to the one PRO term is what currently happens, but I could suppress that if it was desired. I'm satisfied that we're all on the same page, at least with respect to the folded/unfolded stuff. I'm also fine with folding activated/non-activated into a single PRO term if the difference is given only as free text within the entity name, though I'm fairly sure these will prove to be different entities at the molecular level at some point. I've already coded the ability to find when a Reactome EWAS changes its definition (that is, becomes a different PRO term), though it's unclear to me how that will play out in the context of GO-CAMs.

deustp01 commented 1 year ago

I've already coded the ability to find when a Reactome EWAS changes its definition (that is, becomes a different PRO term), though it's unclear to me how that will play out in the context of GO-CAMs.

Best guess is that the new PRO term will propagate fine into the GO-CAM, but may create Reactome - GO-CAM discrepancies. If your code can send me a message whenever it detects a Reactome EWAS with a changed definition, we will work that into the Reactome QC process to keep everything consistent. Can that work? (I can also imagine unintended EWAS changes, that this will also help us detect and correct.)