etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
547 stars 165 forks source link

inconsistency between cns file and targeted coverege file #636

Open enes-ak opened 3 years ago

enes-ak commented 3 years ago

Hi,

I confused in a topic about the CNVkit algorithm.

image

This picure came from cns file (segmented file). There is a deletion at that region of chr6. To understand this result, I checked reference.cnn and target.coverage files.

Related part of reference cnn file: image As you can see there, depth of related range is 1 so log2 value is zero.

Related part of targeted.coverage file: image As you can see there, depth of related range is greater than 1 at the all bin.

So I expected that there should be a gain at that region of chr6. Can anyone explain/interpret that result?

Thanks for all! Best, Enes

tetedange13 commented 3 years ago

Hi @enes-ak ,

Not an author of CNVkit, but if I understand correctly, you are showing us a ".cns" file, corresponding "reference.cnn" and ".targetcoverage.cnn" files? => As you maybe know reading CNVkit's documentation about its pipeline, there is another step between ".cnn" files and ".cns" files: fix, with its associated ".cnr" output file => As this ".cnr" file is the main input file given to segment sub-command, it will probably be easier to explain your results if you could please also share considered portion of this ".cnr" with us => Also here we are lacking some parts of your segment (at beginning and end) --> maybe would be better with a scatter representation? cnvkit.py scatter -s your_sample.cns -c chr6 your_sample.cnr

The same way, we would benefit from having your exact CNVkit command-line used and some extra information about your dataset (hybrid-capture WGS, or amplicon ; panel, WES or WGS ; germline, tumor, matched normal or not ; etc)

Hope this helps. Have a nice day. Felix.

enes-ak commented 3 years ago

Hi @enes-ak ,

Not an author of CNVkit, but if I understand correctly, you are showing us a ".cns" file, corresponding "reference.cnn" and ".targetcoverage.cnn" files? => As you maybe know reading CNVkit's documentation about its pipeline, there is another step between ".cnn" files and ".cns" files: fix, with its associated ".cnr" output file => As this ".cnr" file is the main input file given to segment sub-command, it will probably be easier to explain your results if you could please also share considered portion of this ".cnr" with us => Also here we are lacking some parts of your segment (at beginning and end) --> maybe would be better with a scatter representation? cnvkit.py scatter -s your_sample.cns -c chr6 your_sample.cnr

The same way, we would benefit from having your exact CNVkit command-line used and some extra information about your dataset (hybrid-capture WGS, or amplicon ; panel, WES or WGS ; germline, tumor, matched normal or not ; etc)

Hope this helps. Have a nice day. Felix.

Thank you for answer dear @tetedange13,

I applied all CNV pipeline that I get from CNVkit's documentation , so I used each step and one of them was fix command.

Here is pipeline steps that I use:

I used bed file as interval list and I used access-excludes.hg38.bed as acces files. I didn't use pooled reference.

I am trying to analyse WES and Clinical Exome panels. I am writing an automatize pipeline with snakemake.

Here is my outputs that I mentioned above. Reference.cnn , targetcoverage.cnn, cnr file, calling file respectively.

Screenshot from 2021-06-29 14-32-14

In the reference file, the relevant region depth is 1 and log2 is zero, while the depth is greater than 1 in the targetcoverage cnn file . So I was expecting to see "gain" but when I check cnr and segmented files the log2 ratio is less than zero.

Why?

Thanks for all Best, Enes

tskir commented 3 years ago

Hi @enes-ak, it looks to me that you're using the flat reference, also corroborated by this part in your previous comment:

I didn't use pooled reference.

When you use flat reference, the log2 column will always be '0' and depth will always be '1' (you can check if this is the case in your reference file).

In most situations, it is recommended that you do use a pooled reference, because this allows the algorithm to better account for the underlying variance in the region coverage. The flat reference, which you appear to be using, should only be used as a "last resort" method, for example:

Now that you're using the flat reference, the log2 and depth columns contain, essentially, dummy values. As @tetedange13 correctly mentioned, there is the normalisation step (cnvkit.py fix), which would normalise the total coverage of all samples and calculate the difference compared to the expected values (the ones in the reference).

So, looking at the CNR file in your screenshot, you can see that log2 values are consistenly in the very negative range, suggesting that this region is indeed deleted. This is why, consequently, the complete deletion (cn=0) is being called.

Now, this could be a real deletion, or this could mean that for some reason the assay you are using does not cover this region sufficiently well. As I said, using the flat reference is a last resort method so it can provide suboptimal results.

A final note to be said here is that HLA regions are notoriously difficult to properly align and call, because of their repetitive and highly polymorphic nature.

tskir commented 3 years ago

@enes-ak As a concrete suggestion to try to improve the results, I suggest switching to use the pooled reference. Could you tell me how many samples do you usually have in a sequencing batch?

enes-ak commented 3 years ago

Hi @tskir, Yes I am using flat reference and I know, analysis with pooled reference give more reliable results. I will analyse with pooled in future but for now, I want to use flat reference.

You right, I checked my reference.cnn file again. Log2 column include 0 and -1, depth column includes 1 and 0.5. If I understand truely, in the normalization step log2 values are normalized in cnr file and range is shifting to negative range, so it is shown as deletion.

In this analysis I used only one sample, you can check codes that I used above (at the other mention).

If I analyse with pooled reference, should I use normal sample as reference? How can I know these "normal" references are really "normal" or "control"? How can I find "control" samples?

Thanks!

tetedange13 commented 3 years ago

Hi @enes-ak ,

Constituing a pool of normal ("PoN") is a whole topic in itself, adressed partly here in CNVkit's documentation => I guess you are working on a germline dataset ? (not tumor sample)

Control samples can be simply samples obtained with same wetlab, but with no evidence of CNV (their negative status could have been confirmed by another technique) => I personally find cnvkit.py heatmap useful to evaluate noise and possible CNV from a set of samples (but can be tricky on WES data) => If you have enough samples, you may also find useful --cluster approach, that create sub-groups of reference (based on their log2 profile), then call CNV on your test_sample using sub-group that most correlated to it

If I analyse with pooled reference, should I use normal sample as reference?

Yes, once you selected your pool of reference samples, use: cnvkit.py batch -n *Normal.bam --output-reference my_new_reference.cnn Then provide produced ".cnn" to every CNV calling done for next test samples cnvkit.py batch next_test_sample.bam -r my_new_reference.cnn

Hope this helps. Felix.