etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
547 stars 165 forks source link

"cnvkit.py call --filter cn" crashes on chrY #202

Closed suhrig closed 7 years ago

suhrig commented 7 years ago

Dear Eric,

The cn-filter of the call module crashes quite a number of my samples when processing chrY. I have not been able to figure out when exactly this happens. All I can say is that I have only observed it on chrY and it must have something to do with the coordinates, because if I play around with coordinates in the VCF file the problem vanishes. Specifically, moving the SNP into one of the segments solves the error, but I have seen situations where there were no SNPs outside segments and all segments had at least one SNP and yet the program crashed. Can you kindly look into this issue?

Thanks, Sebastian

Here is a minimal artificial example to reproduce the error:

This is the command and the error:

$ cnvkit.py call test.cns -o test.cn_filtered.cns -v test.vcf -i tumor --filter cn --purity 0.570837228458
Selected test sample tumor
Using INFO DP
Loaded 1 records; skipped: 0 somatic, 0 depth
Kept 1 heterozygous of 1 VCF records
*WARNING* No X found in probes; check the input
Treating sample gender as male
Rescaling sample with purity 0.570837, ploidy 2
Calling copy number with thresholds: -1.1 => 0, -0.25 => 1, 0.2 => 2, 0.7 => 3
Applying filter 'cn'
Traceback (most recent call last):
  File "/opt/cnvkit_virtualenv/bin/cnvkit.py", line 4, in <module>
    __import__('pkg_resources').run_script('CNVkit==0.8.3.dev0', 'cnvkit.py')
  File "/opt/cnvkit_virtualenv/lib/python2.7/site-packages/pkg_resources/__init__.py", line 743, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/opt/cnvkit_virtualenv/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1505, in run_script
    exec(script_code, namespace, namespace)
  File "/opt/cnvkit_virtualenv/lib/python2.7/site-packages/CNVkit-0.8.3.dev0-py2.7.egg/EGG-INFO/scripts/cnvkit.py", line 13, in <module>

  File "build/bdist.linux-x86_64/egg/cnvlib/commands.py", line 848, in _cmd_call

  File "build/bdist.linux-x86_64/egg/cnvlib/commands.py", line 912, in do_call

  File "build/bdist.linux-x86_64/egg/cnvlib/segfilters.py", line 30, in wrapped_f
  File "build/bdist.linux-x86_64/egg/cnvlib/segfilters.py", line 137, in cn
  File "build/bdist.linux-x86_64/egg/cnvlib/segfilters.py", line 53, in squash_by_groups
  File "/opt/cnvkit_virtualenv/lib/python2.7/site-packages/pandas/core/groupby.py", line 694, in apply
    return self._python_apply_general(f)
  File "/opt/cnvkit_virtualenv/lib/python2.7/site-packages/pandas/core/groupby.py", line 698, in _python_apply_general
    self.axis)
  File "/opt/cnvkit_virtualenv/lib/python2.7/site-packages/pandas/core/groupby.py", line 1601, in apply
    res = f(group)
  File "build/bdist.linux-x86_64/egg/cnvlib/segfilters.py", line 86, in squash_region
  File "build/bdist.linux-x86_64/egg/cnvlib/descriptives.py", line 124, in weighted_median
IndexError: index 0 is out of bounds for axis 0 with size 0

Contents of test.cns:

chromosome  start   end gene    log2    depth   probes  weight
Y   2655076 10104053    DUMMY1  0.0829199   31.8973 114 56.0337
Y   13105052    28657037    DUMMY2  0.0687064   33.4669 334 154.999

Contents of test.vcf:

##fileformat=VCFv4.1
##samtoolsVersion=0.1.19-44428cd
##contig=<ID=Y,length=59373566>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)">
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele count (no HWE assumption)">
##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
##INFO=<ID=RPB,Number=1,Type=Float,Description="Read Position Bias">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias (v2) for filtering splice-site artefacts in RNA-seq data. Note: this version may be broken.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="B-allele count">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  tumor
Y   57185053    .   G   T   220 .   DP=24;VDB=1.587579e-01;RPB=-1.039402e+00;AF1=0.5;AC1=1;DP4=6,3,9,5;MQ=41;FQ=148;PV4=1,0.3,0.38,1    GT:PL:GQ:AD 0/1:250,0,175:99:14
etal commented 7 years ago

Thanks for reporting. I'm working on a fix now.

etal commented 7 years ago

This is fixed in the development version now, if you'd like to try it out. Thanks again for the helpful test files.