etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
520 stars 163 forks source link

cnvkit access fails to add >NC* named regions to the .bed file #679

Open vmukhina opened 2 years ago

vmukhina commented 2 years ago

I run cnvkit access (0.9.9) on two files with the same fasta sequence labelled differently and I got different results: NC001526.4 was skipped whereas Nt_001526.4 was added to the output .bed file. Here are both logs. Nt_001526.4: Scanning for accessible regions Accessible region Nt_001526.4:0-7906 (size 7906) Nt_001526.4: Joining over small gaps Wrote test.bed with 1 regions and NC001526.4: Scanning for accessible regions Accessible region NC001526.4:0-7906 (size 7906) Wrote test.bed with 0 regions

Same, cnvkit ignores all NC_ sequences in refseq HG38 assembly so that regions from primary assembly will never appear in the .bed file and there will be no cnv calling for these regions.

etal commented 2 years ago

Yes, that's true. CNVkit doesn't tend to give useful calls on alternative contigs; read mapping is inconsistent.

Here's the filter applied to sequence names in the commands access and antitarget: https://github.com/etal/cnvkit/blob/master/cnvlib/antitarget.py#L115-L122

You could turn off this behavior by calling access.do_access(..., skip_noncanonical=False) through cnvlib: https://github.com/etal/cnvkit/blob/master/cnvlib/access.py#L15