Mutect2 and VarScan2 Filter Assessments

Kcjohnson commented 6 years ago

In the GLASS-WG cohort there are several samples (TCGA GBM/LGG) that have been analyzed by other variant calling pipelines, including multiple callers in the Pan-Cancer Analysis of Whole Genomes (PCAWG) analysis. It would be helpful to benchmark the GLASS-WG Mutect2/VarScan2 calls against these extant data. Additionally, these analyses may also assist in deriving our own Mutect2/VarScan2 consensus calls for the entire GLASS-WG cohort.

SNVs from TCGA samples:

2016 PCAWG consensus
FreeBayes (SpeedSeq, in house)
Mutect2 (in house)
VarScan2 (in house)

To make this analysis generalizable, we are going to put these commands through snakemake and R to generate reports for variant call overlaps:

[x] Put all common files in shared /projects/verhaak-lab/life_history/, including a revised/descriptive vcf file name and file maps.
[ ] Provide Floris CombineVariants and bcftools commands to link vcf files.
[x] Make sure that only PASS filtered calls are being merged for each caller.
[x] If reasonable, create snakemake pipeline for variant merging in the same samples.

Kcjohnson commented 6 years ago

The PCAWG dataset includes 10 samples that are also listed in the GLASS-WG dataset. The PCAWG vcf files (INDELs and SNV_MNVs are separately stored) have been generated by combining calls from at least 2 of the following callers: Broad, Sanger, DFKZ, and Muse. For the purposes of this project, we can consider these samples as a truth set with which to compare our in-house Mutect2 and VarScan2 results. We also intend to use these results to recalibrate the callers depending on SNV overlap.

While I still need to generate some precision/accuracy estimates for VarScan2 and Mutect2 with these data I was able to generate a few Venn Diagrams with the overlaps between Mutect2, VarScan2, and PCAWG. Overall, these results indicate that the callers are performing fairly well. We could relax the VarScan2 requirements to increase its sensitivity. TCGA-06-0190-R1-SNVMNV_Venn.pdf TCGA-14-1034-R2-SNVMNV_Venn.pdf

For the rest of the dataset that does not have PCAWG calls for comparisons we can examine the consensus calls between Mutect2 and VarScan2. We observed that for the low-pass cohorts (HF and MD-LP) there are very few SNVs called. For HK and TCGA, the major observations were that VarScan2 calls were more conservative. The final callsets for the GLASS-WG data will incorporate the consensus calls between Mutect2 and VarScan2. Thus, we can relax the criteria on VarScan2 in order to recovery more true positives.

hk-consensus-calls-zoomed tcga-consensus-calls mda-lp-consensus-calls