To-Do: CNV analysis - Githubissues

fpbarthel commented 6 years ago

CNV results (GATK pipeline) are incoming and need to be QC'ed.

[x] Visually inspect CNV calls, how is the noise level, were sex chromosomes excluded, can we observe characteristic glioma changes (eg. chr 7/10, 1p/19q, 19/20, EGFR, CDKN2A). The GATK guide #11682 and #11683 include some example plots.
[x] Cross-comparison of TCGA WGS segmentation here to TCGA SNP6 segmentation (Firehose > Data > SNP6 CopyNum): breakpoint comparison, amplitude comparison, noise comparison, resolution comparison
[x] Cross-comparison of TCGA WGS segmentation here to TCGA low-pass segmentation (Firehose > Data > LowPass DNASeq CopyNum) using the same metrics as above
[x] Interpretation of cross comparison, which metric is best and why, can we fine tune parameters further?

fpbarthel commented 6 years ago

Comparison of TCGA SNP6 vs GATK WGS-based calls looks very good visually. There are n=44 samples with both callsets available. The GATK WGS-based calls look like they are of higher resolution and the number of breakpoints and segments per sample for the WGS-based calls is much higher. In fact, the number of segments in the WGS-based calls is roughly 10x more than in SNP6 calls (~60k vs ~6k segments).

In this IGV screenshot you can visually compare the resolution of SNP6 based calls (top n=44) versus WGS-based calls (bottom n=44). Samples are sorted by ID within each call set, so the top row of SNP6 calls is the same sample as the top row of GATK calls. Click on the screenshot to zoom.

igv_snapshot_2

Here I sorted samples by ID so every two rows is a unique sample. The top row of every pair is WGS-based and the bottom row is SNP6-based.

igv_snapshot_1

The copy number value density is somewhat different, with the WGS-based data having a much larger left sided tail. Nevertheless, both distributions look nicely centered around zero.

@Kcjohnson @sbamin it seems like GATK calls work very well for TCGA whole genome. Would you agree?

fpbarthel commented 6 years ago

Amongst the lowpass CN calls available on firehose, there is only one sample that overlaps: TCGA-DU-5872-TP. Eyeballing the copy number profile, it looks very similar compared to GATK and SNP6 calls.

Based on these findings I conclude that GATK CNV calls are accurate and of high resolution, seemingly more precise than SNP6 and low-pass calls.

It is still not clear how GATK CNV calls are impacted by selecting different PONs.

fpbarthel / GLASS

To-Do: CNV analysis #13