cBioPortal / cbioportal

cBioPortal for Cancer Genomics
https://cbioportal.org
GNU Affero General Public License v3.0
632 stars 484 forks source link

mutations disappear with change in transcript #9517

Closed tmazor closed 2 years ago

tmazor commented 2 years ago

User reported: https://groups.google.com/g/cbioportal/c/Am93Ex23cKw

Query KRAS in GENIE v11: https://genie.cbioportal.org/results/mutations?cancer_study_list=genie_public&Z_SCORE_THRESHOLD=2.0&RPPA_SCORE_THRESHOLD=2.0&profileFilter=mutations&case_set_id=genie_public_all&gene_list=KRAS&geneset_list=%20&tab_index=tab_visualize&Action=Submit

There are 20270 mutations: image

If I filter to G12V, there are 4457 mutations: image

Now switch to the second transcript in the list (NM_004985) - total mutations drops to 14937 image

And again filter to G12V and there are only 56 mutations image

I don't know how much these 2 transcripts differ, so maybe this is expected, but I did notice that the 56 G12V mutations after switching transcripts are almost exclusively from GRCC & UHN, and there's only ~3 from DFCI/MSK -- given the overall numbers of samples per site, this struck me as very odd. Could there be a data-related reason for that bias in mutations after switching transcripts?

tmazor commented 2 years ago

User identified an additional oddity that supports there being some underlying issues:

Querying all samples results in 20284 mutations on the default transcript and 14951 mutations with the alternate transcript

Querying samples with CNA & mutation results in 14755 mutations on the default transcript and 14076 mutations with the alternate transcript

This suggests to me that most of the mutations that get lost in the transcript conversion are from samples without CNA data, which seems to support perhaps a data/center related issue

tmazor commented 2 years ago

Sorry, realized I left out a key detail - this doesn't happen if you query for KRAS in curated non-redundant studies in public portal - for public portal query, changing transcripts results in basically the same number of mutations: https://www.cbioportal.org/results/mutations?tab_index=tab_visualize&Action=Submit&session_id=627e5a200934121b56df4047&plots_horz_selection=%7B%7D&plots_vert_selection=%7B%7D&plots_coloring_selection=%7B%7D&mutations_transcript_id=ENST00000311936

leexgh commented 2 years ago

Query KRAS in GENIE v11: https://genie.cbioportal.org/results/mutations?cancer_study_list=genie_public&Z_SCORE_THRESHOLD=2.0&RPPA_SCORE_THRESHOLD=2.0&profileFilter=mutations&case_set_id=genie_public_all&gene_list=KRAS&geneset_list=%20&tab_index=tab_visualize&Action=Submit Comparing canonical transcript (ENST00000256078) vs second transcript (non-canonical transcript ENST00000311936).

Compare without filtering:

screenshot-genie cbioportal org-2022 07 05-14_36_36 screenshot-genie cbioportal org-2022 07 05-14_37_49

Compare with filtering:

screenshot-genie cbioportal org-2022 07 05-14_27_10 screenshot-genie cbioportal org-2022 07 05-14_32_46

Reasons:

When on canonical transcript, we show original mutation data. When on non-canonical transcript, we overwrite some info to genome nexus annotation results. When sending request to genome nexus, we extract unique genomic locations from mutations and send to genome nexus. There are two genomic locations that actually point to the same variant:

  1. 12-25398284-25398284-C-A (extracted from 4401 mutations)
  2. 12-25398283-25398284-AC-AA (extracted from 1 mutation)

There is a duplicated "A" in second request.

Genome nexus normalizes them to one genomic location (12-25398284-25398284-C-A), then query for annotation. When sending annotation back to cbioportal, genome nexus sets the key to one of the original queries:

12-25398283-25398284-AC-AA

so only second mutation with key "12-25398283-25398284-AC-AA" is able to find corresponding annotation and showing in the table, the other 4401 mutations couldn't be mapped back due to key not found. So we lost the 4401 mutations in non-canonical transcript "G12V".

I believe "12-25398283-25398284-AC-AA" actually should be represented as "12-25398284-25398284-C-A". @inodb @ritikakundra Do you think this is a data issue?

inodb commented 2 years ago

@leexgh it seems like this is both a data issue and an issue in how we are handling genome nexus queries on the frontend. Like it shouldn't fail to map back after normalization of the genomic change, is there any way we can adjust the code to accommodate for this? I believe we had a solution for this because we had the same issue with the command line annotator

leexgh commented 2 years ago

This pr should fix the problem: https://github.com/genome-nexus/genome-nexus/pull/620 After merging, Genome Nexus should return two annotation objects for

12-25398283-25398284-AC-AA

and

12-25398284-25398284-C-A

So both can find a matching return and won't lose 12-25398284-25398284-C-A (4401 mutations)