geneontology / pathways2GO

Code for converting between BioPAX pathways and Gene Ontology Causal Activity Models (GO-CAM)
8 stars 0 forks source link

Reactome: indistinguishable EWAS (errors/incomplete representations) #156

Closed nataled closed 2 years ago

nataled commented 2 years ago

The processing done to convert Reactome EWAS to PRO makes the assumption that the 'same' proteoform will differ only with respect to location. There is, however, another assumption, and that is that each EWAS is accurately described using UniProtKB identifier plus sequence range plus amino acid modifications (that is, these are the features that PRO uses to distinguish one proteoform from another). Put another way, every proteoform is considered 'different' from others if there's a difference in any one of those features and, conversely, are considered the 'same' if no such difference exists. I therefore attempted to find cases where there might be two EWASes that 'look the same' with respect to those distinguishing features, but are actually different. The attempt ended up serving as a kind of sanity check for EWASes because some cases failed to show differences, but clearly should have. The attached file presents the results of this check, focusing on those that might be errors in representation or are in some way incomplete representations. A following issue will present cases that will require discussion.

Each line will start with either PT-MUTANT, RANGE_DISCREPANCY, TERMINUS, or EWAS. An explanation of each type follows. I also provide an example or two for each type. Remember, based on UniProt+range+modifications, all members of each set are indistinguishable to PRO (so far).

PT-MUTANT: one or more EWAS has a name that indicates a point mutation, but the modifications don't reflect the name. These are, in my view, errors that need repair.

R-HSA-9670033   GO:0005576  P00451  20  391 +303=MOD:01631+303=MOD:00015        F8(20-391) A303E [extracellular region]
R-HSA-9670059   GO:0005576  P00451  20  391 +303=MOD:01631+303=MOD:00015        F8(20-391) S308L [extracellular region]

RANGE_DISCREPANCY: one or more EWAS in the set has a sequence range that differs from the one indicated in the name. These are, again, errors I think.

R-HSA-173672    GO:0005576  P01031  752 1676            C5b alpha' [extracellular region]
R-HSA-8852713   GO:0005576  P01031  752 1676            C5(965-1676) [extracellular region]

R-HSA-215926    GO:0005576  P08572  184 1712            COL4A2(184-1712) [extracellular region]
R-HSA-4085047   GO:0005576  P08572  184 1712            COL4A2(1486-1712) [extracellular region]

TERMINUS: one or more EWAS in the set is indicated as being one or the other terminus of a protein, but it is currently not possible to tell this based on distinguishing features. I suspect some of these cases (most or all of which have at least one ambiguous start/end position) can be updated with numeric sequence termini. reactome_EWAS_indistinguishable.errors.txt

R-HSA-2470618   GO:0005576  P12107  ?   ?   +=MOD:01914     5-Gal-Hyl-collagen alpha-1(XI) chain N-term fragment [extracellular region]
R-HSA-2470595   GO:0005576  P12107  ?   ?   +=MOD:01914     5-Gal-Hyl-collagen alpha-1(XI) chain C-term fragment [extracellular region]

EWAS: Some other difference is indicated in the name, but such difference is not captured by the PRO distinguishing features. Without going through these one by one, I cannot tell if they are errors or incomplete representations, or if they are examples of additional types needing discussion. I did a quick check and tried to keep only those that are likely one of the error/incomplete type.

R-HSA-2172676   GO:0005654  P62805  2   103 +21=MOD:00085       MeK-HIST1H4A [nucleoplasm]
R-HSA-5423110   GO:0005654  P62805  2   103 +21=MOD:00085       MeK21-HIST1H4 [nucleoplasm]

R-HSA-9021349   GO:0005654  P78545  1   371         p-S68-ELF3 [nucleoplasm]
R-HSA-9021353   GO:0005654  P78545  1   371         ELF3 [nucleoplasm]

R-HSA-66344 GO:0005886  P19438  22  455         TNFRSF1A(22-455) [plasma membrane]
R-HSA-5675738   GO:0005886  P19438  22  455         TNFRSF1A [plasma membrane]
deustp01 commented 2 years ago

Two pairs - R-HSA-66344 and R-HSA-5675738, R-HSA-2172676 and R-HSA-5423110 - were duplicates, now merged into single instances. All other pairs are intended to represent differently covalently modified versions of single canonical proteins; annotation details have been fixed to show these differences - details (here: david_and_peter_notes > PRO-Reactome > GitHub156_reactome_EWAS_indistinguishable. Done, so closed. (Actually, not done yet)

nataled commented 2 years ago

It appears that you addressed the examples. However, there was a file attached. It looks like the file link appeared somewhere in the middle of all my text, so it was easily missed. This is the file: reactome_EWAS_indistinguishable.errors.txt

deustp01 commented 2 years ago

Progress - all but four duplication cases have been resolved, and those four have been handed over to the curator who made them to be checked. Results are tabulated in PRO_Reactome > GitHub156_reactome_EWAS_indistinguishable

deustp01 commented 2 years ago

The last four have been resolved, so I'm closing this one again.