etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
547 stars 165 forks source link

GC Bias correction #379

Open jgrady-omico opened 6 years ago

jgrady-omico commented 6 years ago

Hi Eric,

We've been using CNVkit the last couple of years to analyse the capture panel data from the Illumina TST170 cancer panel, and have run around 500 samples through it. We have discovered an issue with GC bias that I thought CNVkit would be internally taking care of, but the outputs seem to indicate it's not working as we would expect.

We noticed the problem because one gene in particular, TSC2, was appearing as homozygously deleted in around 15-20% of samples, but this is not consistent with published data on the gene. We validated a few with FISH and all were false positives. We also noticed that, as well as the false positive deletions, there were frequent calls for amplification of the gene, so it looked as though the CNV calls were perhaps just generally unstable. As a clue to why, we also noticed that the region is GC rich, so decided to use the GC quintile normalised depths from Picard CollectGcBiasMetrics to see if the GC content of the sample was correlated with the log2 calls from CNVkit. And indeed they are.

I've attached plot for each of the segmented log2 calls from CNVkit for each Gene vs the normalised depth of the topmost GC quintile for each sample (N~500). Although most genes show no correlation with the GC content of the sample (as we would hope), for TSC2 there is a very strong direct correlation. The same is true to a lesser extent for several other genes, e.g. CCND1, STK11, SLX4, NOTCH1, etc, and some genes show the inverse correlation, e.g. NOTCH2. If we use the third or fourth quintile, the correlations are reversed.

To address this, we intend to correct the CNVkit log2 calls using the GC content of the sample at the gene level post-CNVkit (basically using a linear model of the plots), but my understanding was that CNVkit already corrects each of the probes for GC content already, does it not? It may be arising because the range of GC in our samples is different from those used to create the CNVkit reference file - the controls were perhaps run in much more controlled conditions than our patient samples.

In any case, we'd appreciate your insight into whether post-CNVkit correction is a sensible approach or whether the more extreme outliers in GC bias are just overall unreliable.

Many thanks,

John.

scatter_gr_q5.pdf](https://github.com/etal/cnvkit/files/2261133/scatter_gr_q5.pdf)

jgrady-omico commented 6 years ago

Note - TSC2 is right at the bottom of the attached pdf.

etal commented 6 years ago

Hi John, thanks for letting me know! I think your hypothesis is a likely one -- first, the degree of GC bias varies by sample, and CNVkit's while automated correction is meant to capture this it does not fully succeed; and second, your reference profile is not entirely representative of the samples you're processing with it, which could be due to a difference in lab processes used, or if normal samples were fresh tissue (e.g. blood draws) while tumor samples are FFPE.

For the short term, I see nothing wrong with doing a post-CNVkit correction to the .cnr files, or at the gene level if you're extracting gene-level calls from gainloss/genemetrics. If it noticeably improves your results, please let me know and I'll see about revising the GC correction procedure to match Picard's percentile-binning method.

Another CNVkit feature currently in development that may help your situation is #308, selecting clusters of control samples to better match each input test sample. That feature, when it's ready, may be able to reduce the problem of unrepresentative control samples in the reference profile.

jgrady-omico commented 5 years ago

Hi Eric,

Just to update you on progress with this. The post correction of cnr files didn't work well.

I didn't have enough controls to stratify them by their picard metrics GC profile, so instead, I created around 15 cnn reference files for various central GC_NC_60_79 values as reported by from picard metrics, using the tumour bams (N~800, 40-60 tumours per GC value). GC_NC_60_79 was the best correlate with our false TSC2 deletion calls, though GC_NC_80_100 was also good. This is probably specific to our capture panel.

For CNV calling/segmentation, I then select the closest cnn reference by the GC_NC_60_79 of the tumour sample I am segmenting and use that.

This has worked extremely well in fixing the GC bias - from around 50 false TSC2 deletion calls, we now have only 5, all of which look genuine, and one which was not originally called as it was a sample with a high GC bias, so the real deletion was obscured by the GC bias induced upwards shift in log2 values. So, this is a great result.

Using the tumours as reference is not without problems though - genuine deletion and amplification calls/segmentation are sometimes not being called, I think because the outliers in the CNV data are not being removed as effectively as they could be, or there are just too many samples with amplified/deleted chromosomal regions included in the reference. The precise issues vary from reference to reference. In the end, I use the CNVkit segmentation, but I do a post-CNVkit check of the segmentation and if the log2 values of the probes for a gene are too far from the segemented line I flag it as a potential false negative. This is not ideal, but it works in the interim so that we don't miss genuine deletion or amplification calls.

In any case - the principle works, we just need a better way of selecting tumours to include in the references, perhaps restricting to just those with low purity.

A solution similar to https://github.com/etal/cnvkit/issues/308 may also help.

etal commented 5 years ago

Thanks for the update. I'll work on these fixes:

How does your post-CNVkit check of probe-level log2 values vs. segment means compare to CNVkit's bootstrapped confidence interval calculation and filter? Is this a manual, visual check?