etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
547 stars 165 forks source link

cnvkit fix doesn't produce cnr file #637

Closed khuhu closed 3 years ago

khuhu commented 3 years ago

Hi,

So I originally ran the batch command on my samples and everything was processed without any error messages. It seemed to have stopped working correctly at the cnvkit fix stage as the target and antitarget files ares produced for each bam file.

So I decided to run the fix command on a set of antitarget and target files. Again there was no output .cnr file produced, and there was no error message. I used the following cnvkit command, and also included some of the what was printed to the terminal while it ran (positions from the ref file). Any idea why no cnr file is produced?

cnvkit.py fix results/SRR5206578_filt2.targetcoverage.cnn results/SRR5206578_filt2.antitargetcoverage.cnn /mnt/DATA5/tmp/kev/tmpDbs/SRA/iKapMice2/FlatReference.cnn -o results/SRR5206578.cnr

('chrX', 134336084, 134336174) ('chrX', 134336084, 134336174) ('chrX', 134336510, 134336639) ('chrX', 134340347, 134340412) ('chrX', 134340347, 134340412) ('chrX', 134340347, 134340412)

tskir commented 3 years ago

Hi @khuhu, would you be able to share the files you're running this on (SRR5206578_filt2.targetcoverage.cnn, SRR5206578_filt2.antitargetcoverage.cnn, FlatReference.cnn)? That would be the easiest way for me to debug this. You could either share them publicly or email files/links to ktsukanov [at] ebi.ac.uk

tskir commented 3 years ago

Note to self — Files received

tskir commented 3 years ago

@khuhu The reason for the problem is that your input files have duplicate regions. For example:

$ grep $'chr1\t3421701\t3421901' FlatReference.cnn 

Result:

chr1    3421701 3421901 XM_006495550.3_exon_2_0_chr1_3421702_r  0       1       0.51    0       0
chr1    3421701 3421901 NM_001011874.1_exon_1_0_chr1_3421702_r  0       1       0.51    0       0
chr1    3421701 3421901 XM_011238395.2_exon_1_0_chr1_3421702_r  0       1       0.51    0       0

Due to how CNVkit operates, it is required that all regions in the input files are unique and non-overlapping.

tskir commented 3 years ago

Additionally, due to a very high number of duplicate values, the error message turned out to be hard to read. It started “Duplicated genomic coordinates in set: ...” and then proceed to print the thousands of lines with the regions, obscuring the actual error.

I will modify the error message to be more informative in such cases and submit a pull request shortly. In the meanwhile, this issue should remain open

khuhu commented 3 years ago

@tskir Ah I had suspected that, but like you said the error message was buried somewhere in the standard error output. Thanks again!

tskir commented 3 years ago

Reopening until #638 is merged