etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
559 stars 166 forks source link

Unable to produce .cnr files #891

Open nithishak opened 5 months ago

nithishak commented 5 months ago

Hello, I am currently using the latest docker image of cnvkit to run:

 cnvkit.py batch $CNV_BAMS/*_T.bam \
        --normal $CNV_BAMS/*_G.bam \
        --targets $BEDFILE_0B_BAITS \
        --fasta $REF_GENOME_b37 \
        --access /data/access-5k-mappable.grch37.bed \
        --output-reference $CNV_BAMS/my_reference.cnn \
        --output-dir $CNV_T_RESULTS \
        --diagram \
        --scatter \
        -p 8 \
        --cluster

However, while for some runs cnvkit runs to completion, for some runs, the log terminates after the cnn file is produced.

Percent reads in regions: 92.708 (of 13203032 mapped)
Wrote sample_G.targetcoverage.cnn with 9583 regions
Processing reads in sample_G.bam
Time: 4.344 seconds (0 reads/sec, 4394 bins/sec)
Summary: #bins=19087, #reads=1, mean=0.0001, min=0.0, max=1.89
Percent reads in regions: 0.000 (of 13203032 mapped)
Wrote sample_G.antitargetcoverage.cnn with 19087 regions
Processing target: sample_G
Keeping 8419 of 9583 bins
Correcting for GC bias...
Correcting for density bias...
Processing antitarget: sample_G
Keeping 1 of 19087 bins
Correcting for GC bias...
ALL DONE

Upon debugging, it looks like when the do_fix function is called on the anti-target file, the assert statement in the _width2wing function in smoothing.py fails and the program terminates without printing any error message. Interestingly, if I use a reference.cnn produced by another run (another set of normal samples), the cnr files are produced.

Usually, when I run cnvkit successfully, the anti-target keeps 0 of x bins and mentions that most bins have low coverage like this:

Processing target: sample_G
Keeping 8415 of 9583 bins
Correcting for GC bias...
Correcting for density bias...
Processing antitarget: sample_G
Keeping 0 of 19087 bins
WARNING: most bins have no or very low coverage; check that the right BED file was used
Correlations with each cluster:
        log2    : 0.9580718621239102
        log2_1  : 0.9575440273517433
        log2_4  : 0.954962901257862
        log2_2  : 0.764261436668614
        log2_3  : 0.7586570106339138
 -> Choosing columns 'log2' and 'spread'
Wrote sample_G.cnr with 8415 regions

I am unable to debug beyond this point and would appreciate any advice! Thank you.

28rietd commented 4 months ago

Cnvkit performs some internal filtering (based on gc, log2, spread and depth=0) of bins that might skew the CNV calling results. My understanding is that the cnvkit smoothing fails (which takes care of bias corrections) when there are less than 2 bins left after filtering (which occurs separately for target and anti-target bins). Since your problem seems to be with the anti-target bins you could try to disable the usage of anti-target bins. I am not entirely sure, since I don't usually use the batch command, but I think you could disable anti-target bins by using --method amplicon.

nithishak commented 4 months ago

thank you for your input. Should I expect to see the low coverage warning for anti-target analysis as these are the non-targeted regions? Also, it always seems to be that one bin from the anti-target file that causes the error, would it be wise to drop that row from reference.cnn for the analysis to complete?