Closed nataled closed 2 years ago
Two pairs - R-HSA-66344 and R-HSA-5675738, R-HSA-2172676 and R-HSA-5423110 - were duplicates, now merged into single instances. All other pairs are intended to represent differently covalently modified versions of single canonical proteins; annotation details have been fixed to show these differences - details (here: david_and_peter_notes > PRO-Reactome > GitHub156_reactome_EWAS_indistinguishable. Done, so closed. (Actually, not done yet)
It appears that you addressed the examples. However, there was a file attached. It looks like the file link appeared somewhere in the middle of all my text, so it was easily missed. This is the file: reactome_EWAS_indistinguishable.errors.txt
Progress - all but four duplication cases have been resolved, and those four have been handed over to the curator who made them to be checked. Results are tabulated in PRO_Reactome > GitHub156_reactome_EWAS_indistinguishable
The last four have been resolved, so I'm closing this one again.
The processing done to convert Reactome EWAS to PRO makes the assumption that the 'same' proteoform will differ only with respect to location. There is, however, another assumption, and that is that each EWAS is accurately described using UniProtKB identifier plus sequence range plus amino acid modifications (that is, these are the features that PRO uses to distinguish one proteoform from another). Put another way, every proteoform is considered 'different' from others if there's a difference in any one of those features and, conversely, are considered the 'same' if no such difference exists. I therefore attempted to find cases where there might be two EWASes that 'look the same' with respect to those distinguishing features, but are actually different. The attempt ended up serving as a kind of sanity check for EWASes because some cases failed to show differences, but clearly should have. The attached file presents the results of this check, focusing on those that might be errors in representation or are in some way incomplete representations. A following issue will present cases that will require discussion.
Each line will start with either PT-MUTANT, RANGE_DISCREPANCY, TERMINUS, or EWAS. An explanation of each type follows. I also provide an example or two for each type. Remember, based on UniProt+range+modifications, all members of each set are indistinguishable to PRO (so far).
PT-MUTANT: one or more EWAS has a name that indicates a point mutation, but the modifications don't reflect the name. These are, in my view, errors that need repair.
RANGE_DISCREPANCY: one or more EWAS in the set has a sequence range that differs from the one indicated in the name. These are, again, errors I think.
TERMINUS: one or more EWAS in the set is indicated as being one or the other terminus of a protein, but it is currently not possible to tell this based on distinguishing features. I suspect some of these cases (most or all of which have at least one ambiguous start/end position) can be updated with numeric sequence termini. reactome_EWAS_indistinguishable.errors.txt
EWAS: Some other difference is indicated in the name, but such difference is not captured by the PRO distinguishing features. Without going through these one by one, I cannot tell if they are errors or incomplete representations, or if they are examples of additional types needing discussion. I did a quick check and tried to keep only those that are likely one of the error/incomplete type.