fpbarthel / GLASS

GLASS consortium
MIT License
37 stars 13 forks source link

To-Do: CNV analysis #13

Closed fpbarthel closed 6 years ago

fpbarthel commented 6 years ago

CNV results (GATK pipeline) are incoming and need to be QC'ed.

fpbarthel commented 6 years ago

Comparison of TCGA SNP6 vs GATK WGS-based calls looks very good visually. There are n=44 samples with both callsets available. The GATK WGS-based calls look like they are of higher resolution and the number of breakpoints and segments per sample for the WGS-based calls is much higher. In fact, the number of segments in the WGS-based calls is roughly 10x more than in SNP6 calls (~60k vs ~6k segments).

In this IGV screenshot you can visually compare the resolution of SNP6 based calls (top n=44) versus WGS-based calls (bottom n=44). Samples are sorted by ID within each call set, so the top row of SNP6 calls is the same sample as the top row of GATK calls. Click on the screenshot to zoom.

igv_snapshot_2

Here I sorted samples by ID so every two rows is a unique sample. The top row of every pair is WGS-based and the bottom row is SNP6-based.

igv_snapshot_1

The copy number value density is somewhat different, with the WGS-based data having a much larger left sided tail. Nevertheless, both distributions look nicely centered around zero.

image

@Kcjohnson @sbamin it seems like GATK calls work very well for TCGA whole genome. Would you agree?

fpbarthel commented 6 years ago

Amongst the lowpass CN calls available on firehose, there is only one sample that overlaps: TCGA-DU-5872-TP. Eyeballing the copy number profile, it looks very similar compared to GATK and SNP6 calls.

Based on these findings I conclude that GATK CNV calls are accurate and of high resolution, seemingly more precise than SNP6 and low-pass calls.

It is still not clear how GATK CNV calls are impacted by selecting different PONs.