Illumina / strelka

Strelka2 germline and somatic small variant caller
GNU General Public License v3.0
357 stars 103 forks source link

What is the correct input bam for Strelka2 #58

Open BilyanaStoilova opened 6 years ago

BilyanaStoilova commented 6 years ago

Hi, I ran Strelka2 with the default parameters for WES and after running Manta: configureStrelkaSomaticWorkflow.py \ --normalBam CM0403_1-control.md.bam \ --tumorBam CM0403_1-tumour.md.bam \ --referenceFasta Homo_sapiens_assembly38.fasta \ --indelCandidates ./Manta_CM0403_1/results/variants/candidateSmallIndels.vcf.gz \ --callRegions SureSelect_Human_All_Exon_V6_r2/S07604514_Padded_full_hg38_sorted.bed.gz \ --exome \ --runDir Strelka_CM0403_1

./Strelka_CM0403_1/runWorkflow.py -m local -j 8

I first ran it on bam files without marking duplicated and without GATK BQSR but then thought that I should probably run it on bam files after marking duplicates without BQSR. I checked for two SNVs I know are present in the sample (they were previously called by targeted sequencing) and I got different calling on one of them (DNMT3A) when using bam files before or after mark duplicates: Before mark duplicates: DNMT3A chr2 25234373 . C T . PASS SOMATIC;QSS=17;TQSS=1;NT=ref;QSS_NT=17;TQSS_NT=1;SGT=CC->CT;DP=124;MQ=60.00;MQ0=0;ReadPosRankSum=-1.75;SNVSB=0.00;SomaticEVS=8.19 DP:FDP:SDP:SUBDP:AU:CU:GU:TU 7:0:0:0:0,0:7,7:0,0:0,0 117:0:0:0:0,0:73,73:0,0:44,44 IDH2 chr15 90088702 . C T . PASS SOMATIC;QSS=15;TQSS=2;NT=ref;QSS_NT=15;TQSS_NT=2;SGT=CC->CT;DP=179;MQ=60.00;MQ0=0;ReadPosRankSum=-0.08;SNVSB=0.00;SomaticEVS=8.91 DP:FDP:SDP:SUBDP:AU:CU:GU:TU 8:0:0:0:0,0:8,8:0,0:0,0 171:7:0:0:0,0:98,98:0,0:66,68

After mark duplicates: DNMT3A chr2 25234373 . C T . LowEVS SOMATIC;QSS=16;TQSS=1;NT=ref;QSS_NT=16;TQSS_NT=1;SGT=CC->CT;DP=97;MQ=60.00;MQ0=0;ReadPosRankSum=-1.46;SNVSB=0.00;SomaticEVS=5.45 DP:FDP:SDP:SUBDP:AU:CU:GU:TU 7:0:0:0:0,0:7,7:0,0:0,0 90:0:0:0:0,0:57,57:0,0:33,33 IDH2 chr15 90088702 . C T . PASS SOMATIC;QSS=13;TQSS=1;NT=ref;QSS_NT=13;TQSS_NT=1;SGT=CC->CT;DP=140;MQ=60.00;MQ0=0;ReadPosRankSum=-0.22;SNVSB=0.00;SomaticEVS=7.78 DP:FDP:SDP:SUBDP:AU:CU:GU:TU 8:0:0:0:0,0:8,8:0,0:0,0 132:0:0:0:0,0:78,78:0,0:54,54

So I get different filters for DNMT3A SNV based on the input bam. To my understanding the reason DNMT3A was assigned LowEVS filter starting from the mark duplicates bam is because of the lower SomaticEVS assigned to it (5.45) and by inspecting the variants which got the PASS filter the SomaticEVS value should be >7 to get a PASS filter. To my understanding the SomaticEVS value is assigned based on random forest. Is the SomaticEVS filter value too stringent? How was the exact value selected? You can see the number of reference and alternative reads for DNMT3A using mark duplicates bam and in my opinion this is a totally legitimate SNV. If I am to use the mark duplicates bam do you think I can take the PASS filter + all variants with LowEVS filter where the SomaticEVS value is, for example, higher than 5? I would really appreciate your advice on this.

ctsa commented 6 years ago

Thanks for the detailed report. You should always have PCR duplicates either marked or filtered before running Strelka. If PCR duplicates are marked, these reads will be ignored by the method. BQSR is not recommended simply because it hasn't been tested recently and has never been shown to provide an improvement when tested. If your BAMs have already gone through BQSR we don't have any evidence that this would hurt.

For the specific case you show above, the issue with both of these calls is the low depth in the normal sample. For the first example, the depth is 7 in the normal sample -- too low to have sufficient confidence that the observed variant is not, in fact, a heterozygous C/T germline SNV rather than a somatic variant.

If you're stuck with the normal depth that you currently have, and you're willing to accept a higher rate of somatic false positive calls that are actually germline variants, you could refilter the somatic VCF file to use a different threshold on the SomaticEVS score. We provide a convenience script for this purpose here:

https://github.com/Illumina/strelka/blob/master/scratch/util/reFilterSomaticVcf.py

BilyanaStoilova commented 6 years ago

Thank you very much for the prompt reply. I do agree with what you wrote that Strelka is probably less confident when there are only 7 reads in the normal sample. However, what confuses me is that there were also 7 reads in the bam before mark duplicates and then Strelka did not have any problem putting the SNV in the PASS filter. The difference between the pre and post mark duplicates is in the number of reads in the tumour, although the VAF in both cases is similar - 37.6% pre and 36.7% post. Also, for the IDH2 variant there were only 8 reads in the normal sample and both pre and post mark duplicates Strelka put the SNV into the PASS filter (VAF 40.2% pre and 40.9% post). That is why I do not see huge difference between the number of reads and VAFs for the two SNVs (DNMT3A and IDH2), however one of them got the LowEVS filter and the other PASS filter. Since I am currently stuck with this normal depth, I might decrease the SomaticEVS score to make the calls. What normal depth do you recommend for variant calling with Strelka? I have sent these sample to be sequenced at 30X and when I checked them after the sequencing they had indeed 30X average coverage but the coverage was very variable along the exome. Thank you again.

ctsa commented 6 years ago

In the 'before' case, the EVS scores are still low and close to the filtration threshold -- so what you're seeing are near-threshold PASS calls being pushed over the edge by a relatively small change to the EVS score itself. At slightly higher depths the somatic calls become more confident primarily because the likelihood of this being a germline het at 50%, but incidentally sampled to 37% becomes lower. Note this behavior of the model was not exactly intended (it does not account for LOH/copy number changes), but naturally falls out of the current model design.

Typically keeping the normal depth at about half of the tumor depth would be best for the purpose of detecting shared systematic error -- for simply detecting germline hets reliably you wouldn't need to keep scaling the normal depth with tumor depth though. Agree there are lots of special challenges to WES depth.

BilyanaStoilova commented 6 years ago

Thank you very much for the detailed reply, it is very helpful.

ggoodstudydaydayup commented 5 years ago

Hi, may I ask a small question ? After marking the PCR duplicate , shall I realign the bam or not ?

sangtaekim commented 5 years ago

@ggoodstudydaydayup Strelka realigns reads internally, so realigning the bam is not necessary.

ggoodstudydaydayup commented 5 years ago

Thanks you very much for your help.

sangtaekim commented 5 years ago

@ggoodstudydaydayup You're welcome.

mortunco commented 5 years ago

Hi, quick question, Is there a way to filter my input file just as strelka considers in their variant calling ? based on my previous issue #119 , I would like to investigate my reads in detailed, so wanna get those reads that are considered in internal strelka2 analysis. -Thanks!