CNV calls when comparing sample vs. itself

suhrig commented 6 years ago

Hi Eric,

When CNVkit is run with the same sample for tumor and reference (i.e., sample vs. itself), then it produces CNV calls. I would have expected to obtain no calls at all, since the coverage ratio should be precisely 1 in all targets and antitargets. In reality, however, similar calls are produced as when using the paired control. For example, when I use the sample EX_11 from the cnvkit-examples repository, CNVkit identifies amplified segments near the centromeres of chr1 and chr3.

This is an artificial scenario, but I would like to understand where those calls come from. Wouldn't this lead to fake calls, when certain regions of the genome have a systematic coverage bias? As far as I understand, systematic biases are exactly the types of artifacts that a paired control is meant to remove. I also noticed that there is quite a level of noise in the coverage ratios, even when comparing a sample against itself. Does CNVkit place bins differently for tumor and reference?

Regards, Sebastian

etal commented 6 years ago

A single normal sample is itself a noisy estimator of the systematic biases in coverage.

CNVkit takes a few extra steps to reduce the impact of noise in the paired control while retaining the signal:

When compiling the pool of control samples, include a "pseudocount" of flat coverages along with the control sample coverages. In effect, assume that half of the variance in a single control sample is noise or sample-specific biases, and the other half is systematic bias. With more control samples, this converges to fully trusting the aggregated sample coverages as an estimate of systematic bias.
Filter out bins where the control coverage is below 2^-5 or above 2^5 -fold versus the genome-wide average. (When processing test samples in fix -- so .cnr bins are a subset of reference .cnn bins.)

Remember that GC and other corrections are done to each sample independently before subtracting the reference, so the normal sample coverage profile is not the only mechanism to remove systematic bias.

In somatic SNV calling it's standard to do the equivalent of calling variants in the tumor and normal samples, then subtract normal calls from the tumor calls to get somatic calls. In somatic CNA calling, I don't think the same approach is appropriate -- the CNVs that are normally seen in non-lesional tissue are usually much smaller than CNVkit is designed to detect, and would not overwhelm the somatic calls the way germline SNPs usually overwhelm somatic SNVs. The segments you see in EX_11 should be visible in the tumor sample even if they're also present in the "normal". If necessary, you can then easily flag possible large-scale CNVs in the germline with a post-processing step.

suhrig commented 6 years ago

Thanks for the detailed explanation!

etal / cnvkit

CNV calls when comparing sample vs. itself #280