geneontology / pathways2GO

Code for converting between BioPAX pathways and Gene Ontology Causal Activity Models (GO-CAM)
8 stars 0 forks source link

Sub-sequence termini revisions/review #309

Open nataled opened 9 months ago

nataled commented 9 months ago

While verifying sub-sequence termini in PRO, I came across a few cases that might need revision in Reactome:

1) R-HSA-114620 and R-HSA-54851: It appears that UniProt has revised the canonical isoform to -15. The C terminus given for these Reactome EWASes (2386) reflects the previous canonical -1. The new canonical ends at 2477. Two possible fixes are (a) specify the isoform as -1; or (b) adjust the number from 2386 to 2477. It is likely the latter should be done, but I can't rule out that the original evidence was specific for the -1 isoform.

2) R-HSA-70607: This is a transit peptide removed form. UniProtKB has updated the transit peptide range from 1-16 to 1-54.

3) R-HSA-5225635: A complicated case. The current canonical isoform is -5, and the numbering shown on the EWAS page uses that numbering. However, when I check the schema page it shows the start and endpoint as 2-2236, which is correct for isoform -1 (the previous canonical).

4) R-HSA-913647: Doesn't seem to correspond to any 'chain' given in UniProtKB, nor to any isoform.

5) The following all show discrepancies between Reactome's main page and schema with respect to sequence endpoints, with the schema endpoints differing from what's shown in UniProtKB: a) R-HSA-70607 b) R-HSA-1015677 c) R-HSA-3247745, R-HSA-3247746, R-HSA-3247747, R-HSA-3781954, R-HSA-3781960 d) R-HSA-947487 e) R-HSA-197604 f) R-HSA-1299434 g) R-HSA-197931 h) R-HSA-3006746 i) R-HSA-5173101 j) R-HSA-2975974, R-HSA-3132764 k) R-HSA-8863000, R-HSA-8863002 l) R-HSA-5173055 m) R-HSA-74825 n) R-HSA-5696909 o) R-HSA-60026 p) R-HSA-913640 q) R-HSA-5368243 r) R-HSA-5368216

Unfortunately, I'm sure I failed to catch a number of cases similar to case 5, because I did not routinely check the schema endpoints (I initially assumed they corresponded to what was shown on the public page). The schema issues are of concern because (I think) it is from there that the proteoform information that is given to PRO comes from.

deustp01 commented 9 months ago

Quick note on item 3. The start and end coordinates you see at EWAS page are copied directly from our local, updated copy of the UniProt instance and show that we copied their sequence change correctly. The coordinates you see at schema page show that we have failed to edit our EWAS instances to reflect that change.

Question to help with further planning - why are these discrepancies only being detected now? Are you running new tests that you didn't do previously, or are these new problems due to UniProt or Reactome changes made since the last round of PRO QC checking you did?

nataled commented 9 months ago

This is a new test that I added to PRO quality control. Roughly about 50% of the cases I found were reflected in Reactome EWASes (these all from a Reactome EWAS addition done about 8 years ago).

deustp01 commented 9 months ago

OK. So the new test detects a longstanding problem in which a change in the numbering of a UniProt canonical entity is not detected at Reactome so Reactome EWASs that derive residue numbering from that UniProt entity are out of synch / wrong.

As the comparison you did above shows, the change is actually detected but we do not recognize its consequences and therefore don't do anything about it. Development work is underway to change that, so that we do identify UniProt changes that could affect existing Reactome EWASs and notify curators to revise the affected EWASs, as a first step. As a more ambitious second step, I hope we will be able to identify a subset of cases where the corrections can be made automatically, or, less ambitious, that the script can suggest the exact edits that are needed so all the human curator should need to do is approve them.

Meanwhile, the list that started this ticket will go out to curators for manual review and revision.

nataled commented 9 months ago

Was able to recover a majority of the ones I didn't do the schema check for:

R-HSA-1008241 R-HSA-1236771 R-HSA-1236813 R-HSA-1456459 R-HSA-1456468 R-HSA-1498767 R-HSA-1524106 R-HSA-1524113 R-HSA-158545 R-HSA-1614525 R-HSA-164312 R-HSA-166251 R-HSA-166394 R-HSA-167703 R-HSA-170805 R-HSA-170818 R-HSA-180528 R-HSA-191348 R-HSA-192146 R-HSA-193975 R-HSA-1964509 R-HSA-197984 R-HSA-199305 R-HSA-211013 R-HSA-212281 R-HSA-2130340 R-HSA-2142706 R-HSA-2142768 R-HSA-264956 R-HSA-265084 R-HSA-2980863 R-HSA-3004516 R-HSA-3301975 R-HSA-3302050 R-HSA-350587 R-HSA-350610 R-HSA-350699 R-HSA-350736 R-HSA-350752 R-HSA-350753 R-HSA-350808 R-HSA-351194 R-HSA-375320 R-HSA-3828056 R-HSA-3878117 R-HSA-390925 R-HSA-418465 R-HSA-418501 R-HSA-420043 R-HSA-429010 R-HSA-432693 R-HSA-444585 R-HSA-444591 R-HSA-448770 R-HSA-448798 R-HSA-508581 R-HSA-5250557 R-HSA-5250576 R-HSA-5334660 R-HSA-53845 R-HSA-54049 R-HSA-5433075 R-HSA-5490321 R-HSA-55875 R-HSA-5610414 R-HSA-5626960 R-HSA-5626972 R-HSA-5687032 R-HSA-5690054 R-HSA-5696151 R-HSA-57842 R-HSA-59502 R-HSA-60022 R-HSA-60024 R-HSA-60136 R-HSA-67421 R-HSA-6787806 R-HSA-6791197 R-HSA-6798510 R-HSA-6798716 R-HSA-6804793 R-HSA-6804799 R-HSA-6804817 R-HSA-6807524 R-HSA-6809577 R-HSA-6809628 R-HSA-6809857 R-HSA-6809868 R-HSA-68652 R-HSA-70455 R-HSA-71066 R-HSA-71297 R-HSA-71565 R-HSA-71713 R-HSA-71798 R-HSA-72377 R-HSA-74841 R-HSA-8864566 R-HSA-8868799 R-HSA-8868849 R-HSA-8869106 R-HSA-8932647 R-HSA-8942610 R-HSA-8943831 R-HSA-8953432 R-HSA-8957213 R-HSA-8985251 R-HSA-917732 R-HSA-937309 R-HSA-975603 R-HSA-9756394 R-HSA-977578

All of these are initiator methionine (no longer) removed (in UniProtKB).