Closed Kcjohnson closed 6 years ago
The PCAWG dataset includes 10 samples that are also listed in the GLASS-WG dataset. The PCAWG vcf files (INDELs and SNV_MNVs are separately stored) have been generated by combining calls from at least 2 of the following callers: Broad, Sanger, DFKZ, and Muse. For the purposes of this project, we can consider these samples as a truth
set with which to compare our in-house Mutect2 and VarScan2 results. We also intend to use these results to recalibrate the callers depending on SNV overlap.
While I still need to generate some precision/accuracy estimates for VarScan2 and Mutect2 with these data I was able to generate a few Venn Diagrams with the overlaps between Mutect2, VarScan2, and PCAWG. Overall, these results indicate that the callers are performing fairly well. We could relax the VarScan2 requirements to increase its sensitivity. TCGA-06-0190-R1-SNVMNV_Venn.pdf TCGA-14-1034-R2-SNVMNV_Venn.pdf
For the rest of the dataset that does not have PCAWG calls for comparisons we can examine the consensus calls between Mutect2 and VarScan2. We observed that for the low-pass cohorts (HF and MD-LP) there are very few SNVs called. For HK and TCGA, the major observations were that VarScan2 calls were more conservative. The final callsets for the GLASS-WG data will incorporate the consensus calls between Mutect2 and VarScan2. Thus, we can relax the criteria on VarScan2 in order to recovery more true positives.
Update: Examples of filters applied across all cohorts and the number of variants (SNVs and indels) called by Mutect2:
We still need to determine how/whether to merge Mutect2 results with VarScan2.
We need to perform a comparison of this callset to newer calls using a more restrictive cohort-wide PON
Closing this issue for now, since 1) we concluded M2 filters seem to be appropriate and 2) we are going single-caller for the sake of time
In the GLASS-WG cohort there are several samples (TCGA GBM/LGG) that have been analyzed by other variant calling pipelines, including multiple callers in the Pan-Cancer Analysis of Whole Genomes (PCAWG) analysis. It would be helpful to benchmark the GLASS-WG Mutect2/VarScan2 calls against these extant data. Additionally, these analyses may also assist in deriving our own Mutect2/VarScan2 consensus calls for the entire GLASS-WG cohort.
SNVs from TCGA samples:
To make this analysis generalizable, we are going to put these commands through snakemake and R to generate reports for variant call overlaps: