etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
547 stars 165 forks source link

Log2 ratio of normal reference #208

Closed leoSeattle closed 7 years ago

leoSeattle commented 7 years ago

I have a general question regarding to the log2 ration of normal reference samples. I followed the suggestions in documentation to filter the noisy normal samples by doing reference, coverage, segment, and finally I did the scatter using ONLY normal samples. I am a bit confused about the log2 ratio in the resulting cnr, cns files and the scatter plot.

My understanding is that the log2 ratio is the log2 ratio of (read counts from normal) and (read counts from disease samples). But in case of only normal samples used, what does the log2 ratio mean? how was it calculated? Thanks

etal commented 7 years ago

In any .cnr file the log2 ratio is the ratio of the normalized coverage depth at a bin versus the normalized reference coverage. Normalized means recentered (in log2 scale) so that the genome-wide average bin log2 value is 0. If the reference is a pool of normals, the reference log2 value is the pool's average; if the reference is flat/generic, then the output log2 values are relative to the genome-wide average log2 value.

andyjslee commented 6 years ago

I have a similar question as leoSeattle. When I try to calculate the log2 value in the cnr file using the method described above by etal, I don't quite get similar values. Here is an example:

reference.cnn

chromosome start end gene log2 depth gc spread
chr1 12050 12277 LOC102725121,DDX11L1 -4.01333 3.46351 0.515419 0.541461

targetcoverage.cnn

chromosome start end gene depth log2
chr1 12050 12277 LOC102725121,DDX11L1 3.66079 1.87216

And in the cnr file I get the following:

chromosome start end gene depth log2 weight
chr1 12050 12277 LOC102725121,DDX11L1 3.66079 -0.487125 0.371021

So according to the description above, the log2 value in the cnr file should be obtained by: log2(3.66079/3.46351). However, this results in approximately 0.0796, and not -0.487125 as shown above.

Thank you for your time and help in advance.

etal commented 6 years ago

The 'depth' column is there for information and for convenience in filtering out no-coverage bins. CNVkit does this instead:

andyjslee commented 6 years ago

So what log2 value in the cnr file implies a potential gain? Is it a positive value (>0.0) or a value above 1.0?

Eric, I have a problem where I cannot seem to agree with the results from the cnr/cns data against what I see on IGV. For your information, I am using some matched tumor/normal WES data.

Here is a screenshot of MYC in one of the samples using cnvkit.py scatter myc

Here is a IGV screenshot of the same sample in MYC igv_myc

As you can see in the cnvkit.py scatter plot there are positive log2 values in MYC (though below 1), but the same region seems to be a gain on IGV (top track is normal and bottom track is tumor; y-axis is 450 in both tracks). Is this just an over-segmentation problem?

Thank you for your response, as always, Eric!