Log2 ratio of normal reference

etal / cnvkit

Copy number variant detection from targeted DNA sequencing

http://cnvkit.readthedocs.org

Other

547 stars 165 forks source link

Log2 ratio of normal reference #208

Closed leoSeattle closed 7 years ago

leoSeattle commented 7 years ago

I have a general question regarding to the log2 ration of normal reference samples. I followed the suggestions in documentation to filter the noisy normal samples by doing reference, coverage, segment, and finally I did the scatter using ONLY normal samples. I am a bit confused about the log2 ratio in the resulting cnr, cns files and the scatter plot.

My understanding is that the log2 ratio is the log2 ratio of (read counts from normal) and (read counts from disease samples). But in case of only normal samples used, what does the log2 ratio mean? how was it calculated? Thanks

etal commented 7 years ago

In any .cnr file the log2 ratio is the ratio of the normalized coverage depth at a bin versus the normalized reference coverage. Normalized means recentered (in log2 scale) so that the genome-wide average bin log2 value is 0. If the reference is a pool of normals, the reference log2 value is the pool's average; if the reference is flat/generic, then the output log2 values are relative to the genome-wide average log2 value.

andyjslee commented 6 years ago

I have a similar question as leoSeattle. When I try to calculate the log2 value in the cnr file using the method described above by etal, I don't quite get similar values. Here is an example:

reference.cnn

chromosome start end gene log2 depth gc spread

chr1 12050 12277 LOC102725121,DDX11L1 -4.01333 3.46351 0.515419 0.541461

chromosome	start	end	gene	log2	depth	gc	spread
chr1	12050	12277	LOC102725121,DDX11L1	-4.01333	3.46351	0.515419	0.541461

targetcoverage.cnn

chromosome start end gene depth log2

chr1 12050 12277 LOC102725121,DDX11L1 3.66079 1.87216

chromosome	start	end	gene	depth	log2
chr1	12050	12277	LOC102725121,DDX11L1	3.66079	1.87216

And in the cnr file I get the following:

chromosome start end gene depth log2 weight

chr1 12050 12277 LOC102725121,DDX11L1 3.66079 -0.487125 0.371021

chromosome	start	end	gene	depth	log2	weight
chr1	12050	12277	LOC102725121,DDX11L1	3.66079	-0.487125	0.371021

So according to the description above, the log2 value in the cnr file should be obtained by: log2(3.66079/3.46351). However, this results in approximately 0.0796, and not -0.487125 as shown above.

Thank you for your time and help in advance.

etal commented 6 years ago

The 'depth' column is there for information and for convenience in filtering out no-coverage bins. CNVkit does this instead:

Center targetcoverage.cnn 'log2' values so the average is 0
Correct for GC and targeting density biases, subtracting those trendlines from the centered log2 values
Subtract the corresponding bin 'log2' value in reference.cnn, where bias corrections have already been applied.

andyjslee commented 6 years ago

So what log2 value in the cnr file implies a potential gain? Is it a positive value (>0.0) or a value above 1.0?

Eric, I have a problem where I cannot seem to agree with the results from the cnr/cns data against what I see on IGV. For your information, I am using some matched tumor/normal WES data.

Here is a screenshot of MYC in one of the samples using cnvkit.py scatter myc

Here is a IGV screenshot of the same sample in MYC igv_myc

As you can see in the cnvkit.py scatter plot there are positive log2 values in MYC (though below 1), but the same region seems to be a gain on IGV (top track is normal and bottom track is tumor; y-axis is 450 in both tracks). Is this just an over-segmentation problem?

Thank you for your response, as always, Eric!