AlexsLemonade / OpenPBTA-analysis

The analysis repository for the Open Pediatric Brain Tumor Atlas Project
Other
100 stars 67 forks source link

Proposed Analysis: add scavenging of cancer hotspots to consensus SNV calls #819

Closed jharenza closed 3 years ago

jharenza commented 4 years ago

What analysis are you proposing and why?

Create a new MAF which contains consensus SNV calls from consensus snv calling and cancer hotspot calls missed by consensus, as noted below.

We previously noticed that by taking a 3/3 approach for consensus calls, we are inevitably missing some cancer hotspot mutations. We got around that for one specific cancer (DMGs) because we have clinical reports containing histone variant calls that we can add into molecular subtyping pathology module (#735 and #751). However, we are likely still missing some cancer hotspot mutations and I propose that we add a final step in which we scavenge back cancer hotspot mutations using a well-curated and downloadable list of these.

What changes need to be made? Please provide enough detail for another participant to make the update.

The next step would be to assess if any of these hotspot mutations are being missed using a 3/3 method and then determining a set of rules for adding these mutations back to the consensus SNV file. For example:

  1. If the hotspot is present in 2/3, then retain
  2. If the hotspot is present in 1/3, and has X reads supporting the tumor allele (5?), then retain

Perhaps the new file can be called pbta-consensus-snvs-plus-hotspot.maf.gz

What input data should be used? Which data were used in the version being updated?

Cancer hotspots table, downloadable here: https://www.cancerhotspots.org/#/download plus TERT promoter mutations, noted from this paper.

In 2013, two hotspot point mutations were found in the TERT promoter in 71% of melanomas (32,33). The mutations were located 124bp and 146bp upstream of the translation start site and referred to as C228T and C250T, respectively, based on their hg19 genomic coordinates.

pbta-snv-consensus-mutation.maf.tsv.gz
pbta-snv-lancet.vep.maf.gz
pbta-snv-mutect2.vep.maf.gz
pbta-snv-strelka2.vep.maf.gz
pbta-snv-vardict.vep.maf.gz (maybe? TCGA was not run with VarDict, so perhaps we should add a separate step to assess what hotspots VarDict only detects)
pbta-tcga-snv-lancet.vep.maf.gz
pbta-tcga-snv-mutect2.vep.maf.gz
pbta-tcga-snv-strelka2.vep.maf.gz
tcga-snv-consensus-snv.maf.tsv.gz

When do you expect the analysis will be completed?

not sure

Who will complete the updated analysis?

s>@migbro</s @kgaonkar6

jashapiro commented 4 years ago

I think this is a good idea, but I would keep it as a separate analysis from the general consensus. In other words, I would not "scavenge" back mutations into the consensus, but rather include an entirely separate analysis that evaluates known mutations. This would keep the standards clear and separate de novo analysis from analysis with outside influence.

jharenza commented 4 years ago

I think this is a good idea, but I would keep it as a separate analysis from the general consensus. In other words, I would not "scavenge" back mutations into the consensus, but rather include an entirely separate analysis that evaluates known mutations. This would keep the standards clear and separate de novo analysis from analysis with outside influence.

Ok - yeah I went back and forth on that. Thanks!

jharenza commented 3 years ago

@kgaonkar6, after our internal discussion, I think we need to first determine our hotspot list: 1) Look into using both versions of the hotspot table linked above. There are 470 hotspots in V1, 1110 in V2, and 221 of them from V1 are not in V2. I initially was thinking we would want a union, but I don't really recognize the list of V1 only, so maybe they were removed because they were FP. So, we may go with V2. I was thinking we could first assess how many of those V1 only hotspots are being missed in our dataset and if it makes sense to keep them or not. 2) Add the TERT promoter mutations above. 3) Download the latest version of COSMIC mutations and determine whether we are missing any of these from V2 - these could also possibly be added.

If that makes sense, I think that can be the first PR for this series. Thanks@

jharenza commented 3 years ago

We are having a call on Thursday Jan 28 with David Wheeler (St Jude, formerly BCM) who has done this sort of thing while leading the BCM Genomics Lab. We might also want to add pediatric0-specific genes such as those from Ma, 2018 and Grobner, 2018

kgaonkar6 commented 3 years ago

Don't think there are annotations in maf format to filter using the information in the paper describing the TERT promoter variant, should I use other filtering the exact genomic site to capture?

I believe chr5 | 1295113 | 1295113 which is also annotated as existing_variant rs1242535815,COSM1716563,COSM1716558 which is 66bp away from TSS is what we are looking for corresponding to C228T.

and chr5 | 1295135 | 1295135 | is 88 bp away from TSS is the COSM1716559 variant which corresponds to C250T promoter variant.

From my google searches :D https://www.slideshare.net/ThermoFisher/taqman-dpcr-liquid-biopsy-assays-targeting-the-tert-promoter-region https://assets.thermofisher.com/TFS-Assets/LSG/posters/taqman-dpcr-tert-promoter-poster.pdf

I checked strelka for upstream variants as a check and we have both these sites (along with others) :   Chromosome Start_Position End_Position Reference_Allele Tumor_Seq_Allele2 Hugo_Symbol Variant_Classification IMPACT Tumor_Sample_Barcode Protein_position Existing_variation DISTANCE
1 chr5 1295113 1295113 G A TERT 5'Flank MODIFIER BS_KAZENYZE NA rs1242535815,COSM1716563,COSM1716558 66
2 chr5 1299855 1299855 A C TERT 5'Flank MODIFIER BS_1RF75MK2 NA NA 4808
3 chr5 1299236 1299237 - C TERT 5'Flank MODIFIER BS_S0T3CQ97 NA NA 4189
4 chr5 1295088 1295088 A C TERT 5'Flank MODIFIER BS_8Q8CAY84 NA NA 41
5 chr5 1299748 1299748 T C TERT 5'Flank MODIFIER BS_4QFSH7C4 NA NA 4701
6 chr5 1295677 1295677 T G TERT 5'Flank MODIFIER BS_4ZKN0WGS NA NA 630
7 chr5 1295442 1295442 C T TERT 5'Flank MODIFIER BS_02YBZSBY NA NA 395
8 chr5 1295135 1295135 G A TERT 5'Flank MODIFIER BS_F8K4VQMF NA COSM1716559 88
9 chr5 1295997 1295997 A C TERT 5'Flank MODIFIER BS_WH8KWW5J NA NA 950
10 chr5 1298925 1298925 A C TERT 5'Flank MODIFIER BS_VW4XN9Y7 NA NA 3878
11 chr5 1295113 1295113 G A TERT 5'Flank MODIFIER BS_1S2BHJ8K NA rs1242535815,COSM1716563,COSM1716558 66
12 chr5 1295146 1295146 A C TERT 5'Flank MODIFIER BS_BM95DGCQ NA NA 99
13 chr5 1295407 1295407 A C TERT 5'Flank MODIFIER BS_9ZFXXJPK NA NA 360
14 chr5 1295113 1295113 G A TERT 5'Flank MODIFIER BS_BFDEZK1C NA rs1242535815,COSM1716563,COSM1716558 66
15 chr5 1297846 1297846 G T TERT 5'Flank MODIFIER BS_VF099E8S NA NA 2799
16 chr5 1296053 1296053 C A TERT 5'Flank MODIFIER BS_0FQKT8EY NA NA 1006
17 chr5 1295113 1295113 G A TERT 5'Flank MODIFIER BS_JSNJZERZ NA rs1242535815,COSM1716563,COSM1716558 66
18 chr5 1295113 1295113 G A TERT 5'Flank MODIFIER BS_T7WMJ08W NA rs1242535815,COSM1716563,COSM1716558 66
19 chr5 1295136 1295136 A C TERT 5'Flank MODIFIER BS_QX754ADQ NA NA 89
20 chr5 1295113 1295113 G A TERT 5'Flank MODIFIER BS_MJJZJMTK NA rs1242535815,COSM1716563,COSM1716558 66
21 chr5 1295113 1295113 G A TERT 5'Flank MODIFIER BS_SK4H5MJQ NA rs1242535815,COSM1716563,COSM1716558 66
22 chr5 1295112 1295112 A C TERT 5'Flank MODIFIER BS_K3PPH522 NA NA 65
23 chr5 1295113 1295113 G A TERT 5'Flank MODIFIER BS_KAD49R68 NA rs1242535815,COSM1716563,COSM1716558 66
24 chr5 1298190 1298190 G A TERT 5'Flank MODIFIER BS_AF5D41PD NA rs929384767 3143
kgaonkar6 commented 3 years ago

We still want to filter by IMPACT == 'HIGH|MODERATE|MODIFIER' to remove any LOW impact mutations ( like silent mutations) in the given amino acid position in hotspot database, right?

jharenza commented 3 years ago

We still want to filter by IMPACT == 'HIGH|MODERATE|MODIFIER' to remove any LOW impact mutations ( like silent mutations) in the given amino acid position in hotspot database, right?

Are you saying there are low impact mutations on the MSK list? I would assume they would not be low.

jharenza commented 3 years ago

I believe chr5 | 1295113 | 1295113 which is also annotated as existing_variant rs1242535815,COSM1716563,COSM1716558 which is 66bp away from TSS is what we are looking for corresponding to C228T.

and chr5 | 1295135 | 1295135 | is 88 bp away from TSS is the COSM1716559 variant which corresponds to C250T promoter variant.

This looks right to me, and nucleotides are reversed because TERT is on the reverse strand. So, I think we should use the genomic coordinates here + nucleotides.

kgaonkar6 commented 3 years ago

We still want to filter by IMPACT == 'HIGH|MODERATE|MODIFIER' to remove any LOW impact mutations ( like silent mutations) in the given amino acid position in hotspot database, right?

Are you saying there are low impact mutations on the MSK list? I would assume they would not be low.

There were a few instances that the hotspot amino acid site had silent mutation for example we have 644 in SDHA is a hotspot in MSKCC but if we have p.V644= in our dataset we should remove it right? Only if it is a high if the hotspot is actually high impact mutations like p.V644M we will keep them.

jharenza commented 3 years ago

Closed with #819