Closed fpbarthel closed 6 years ago
Comparison of TCGA SNP6 vs GATK WGS-based calls looks very good visually. There are n=44 samples with both callsets available. The GATK WGS-based calls look like they are of higher resolution and the number of breakpoints and segments per sample for the WGS-based calls is much higher. In fact, the number of segments in the WGS-based calls is roughly 10x more than in SNP6 calls (~60k vs ~6k segments).
In this IGV screenshot you can visually compare the resolution of SNP6 based calls (top n=44) versus WGS-based calls (bottom n=44). Samples are sorted by ID within each call set, so the top row of SNP6 calls is the same sample as the top row of GATK calls. Click on the screenshot to zoom.
Here I sorted samples by ID so every two rows is a unique sample. The top row of every pair is WGS-based and the bottom row is SNP6-based.
The copy number value density is somewhat different, with the WGS-based data having a much larger left sided tail. Nevertheless, both distributions look nicely centered around zero.
@Kcjohnson @sbamin it seems like GATK calls work very well for TCGA whole genome. Would you agree?
Amongst the lowpass CN calls available on firehose, there is only one sample that overlaps: TCGA-DU-5872-TP
. Eyeballing the copy number profile, it looks very similar compared to GATK and SNP6 calls.
Based on these findings I conclude that GATK CNV calls are accurate and of high resolution, seemingly more precise than SNP6 and low-pass calls.
It is still not clear how GATK CNV calls are impacted by selecting different PONs.
CNV results (GATK pipeline) are incoming and need to be QC'ed.