bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
988 stars 354 forks source link

CNVkit segmentation exception; KeyError: 'weight' #1590

Closed tkoomar closed 7 years ago

tkoomar commented 8 years ago

New error during creation of the sorted segemntation file, where CNVkit complains of KeyError: 'weight'. I initially though this might be somewhat similar to issue #1441, but after more poking around I do not believe it is related to coverage at all. Unfortunately, my relative lack of python experience is making it difficult to determine if the error is with CNVkit or bcbio.

Traceback is below, gist of full debug log here.

[2016-10-04T23:30Z] Traceback (most recent call last):
[2016-10-04T23:30Z]   File "/Dedicated/jmichaelson-wdata/bcbio/anaconda/bin/cnvkit.py", line 13, in <module>
[2016-10-04T23:30Z]     args.func(args)
[2016-10-04T23:30Z]   File "/Dedicated/jmichaelson-wdata/bcbio/anaconda/lib/python2.7/site-packages/cnvlib/commands.py", line 726, in _cmd_segment
[2016-10-04T23:30Z]     processes=args.processes)
[2016-10-04T23:30Z]   File "/Dedicated/jmichaelson-wdata/bcbio/anaconda/lib/python2.7/site-packages/cnvlib/segmentation/__init__.py", line 34, in do_segmentation
[2016-10-04T23:30Z]     save_dataframe, rlibpath)
[2016-10-04T23:30Z]   File "/Dedicated/jmichaelson-wdata/bcbio/anaconda/lib/python2.7/site-packages/cnvlib/segmentation/__init__.py", line 130, in _do_segmentation
[2016-10-04T23:30Z]     transfer_fields(segarr, cnarr)
[2016-10-04T23:30Z]   File "/Dedicated/jmichaelson-wdata/bcbio/anaconda/lib/python2.7/site-packages/cnvlib/segmentation/__init__.py", line 179, in transfer_fields
[2016-10-04T23:30Z]     segweights[i] = subprobes['weight'].sum()
[2016-10-04T23:30Z]   File "/Dedicated/jmichaelson-wdata/bcbio/anaconda/lib/python2.7/site-packages/cnvlib/gary.py", line 130, in __getitem__
[2016-10-04T23:30Z]     return self.data[index]
[2016-10-04T23:30Z]   File "/Dedicated/jmichaelson-wdata/bcbio/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 1997, in __getitem__
[2016-10-04T23:30Z]     return self._getitem_column(key)
[2016-10-04T23:30Z]   File "/Dedicated/jmichaelson-wdata/bcbio/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 2004, in _getitem_column
[2016-10-04T23:30Z]     return self._get_item_cache(key)
[2016-10-04T23:30Z]   File "/Dedicated/jmichaelson-wdata/bcbio/anaconda/lib/python2.7/site-packages/pandas/core/generic.py", line 1350, in _get_item_cache
[2016-10-04T23:30Z]     values = self._data.get(item)
[2016-10-04T23:30Z]   File "/Dedicated/jmichaelson-wdata/bcbio/anaconda/lib/python2.7/site-packages/pandas/core/internals.py", line 3290, in get
[2016-10-04T23:30Z]     loc = self.items.get_loc(item)
[2016-10-04T23:30Z]   File "/Dedicated/jmichaelson-wdata/bcbio/anaconda/lib/python2.7/site-packages/pandas/indexes/base.py", line 1947, in get_loc
[2016-10-04T23:30Z]     return self._engine.get_loc(self._maybe_cast_indexer(key))
[2016-10-04T23:30Z]   File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)
[2016-10-04T23:30Z]   File "pandas/index.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)
[2016-10-04T23:30Z]   File "pandas/hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)
[2016-10-04T23:30Z]   File "pandas/hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)
[2016-10-04T23:30Z] KeyError: 'weight'
[2016-10-04T23:32Z] Uncaught exception occurred
Traceback (most recent call last):
 File "/Dedicated/jmichaelson-wdata/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
     _do_run(cmd, checks, log_stdout)
 File "/Dedicated/jmichaelson-wdata/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
     raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; unset R_HOME && export PATH=/Dedicated/jmichaelson-wdata/bcbio/anaconda/bin:$PATH && /Dedicated/jmichaelson-wdata/bcbio/anaconda/bin/cnvkit.py segment -o /Dedicated/jmichaelson-sdata/SLI_WGS/batch10/fq/sample_51/structural/sample51/cnvkit/raw/tx/tmpLVW5Ul/sample51-sort.cns /Dedicated/jmichaelson-sdata/SLI_WGS/batch10/fq/sample_51/structural/sample51/cnvkit/raw/sample51-sort.cnr -v /Dedicated/jmichaelson-sdata/SLI_WGS/batch10/fq/sample_51/freebayes/sample51-effects-filter-sample51.vcf.gz --threshold 0.00001
Segmented on allele freqs in 7:100619404-100637928
Segmented on allele freqs in 7:136519357-136521870
Segmented on allele freqs in 7:143397459-143420484
Segmented on allele freqs in 7:143550765-143576777
Segmented on allele freqs in 8:101570-124545
Segmented on allele freqs in 8:6503638-6549656
Segmented on allele freqs in 8:7018071-7026653
Segmented on allele freqs in 8:7803343-7873310
Segmented on allele freqs in 8:7873309-7879807
Segmented on allele freqs in 8:12206517-12233494
Segmented on allele freqs in 8:12235491-12251477
Segmented on allele freqs in 8:35386267-35389788
Segmented on allele freqs in 8:37334422-37383903
Segmented on allele freqs in 8:74381272-74476829
Segmented on allele freqs in 8:86554900-86556392
Segmented on allele freqs in 9:39260803-39399278
Segmented on allele freqs in 9:39444360-39472012
Segmented on allele freqs in 9:39472011-39492368
Segmented on allele freqs in 9:40798375-40818392
Segmented on allele freqs in 9:40818391-40838408
Segmented on allele freqs in 9:42719806-42755328
Segmented on allele freqs in 9:43064426-43127419
Segmented on allele freqs in 9:44411449-44473971
Segmented on allele freqs in 9:44473970-44676889
Segmented on allele freqs in 9:46999708-47009706
Segmented on allele freqs in 9:97105289-97109291
Segmented on allele freqs in 9:138931332-138947315
Segmented on allele freqs in 9:141121999-141124005
Segmented on allele freqs in 10:31117653-31246782
Segmented on allele freqs in 10:46371106-46560475
Segmented on allele freqs in 10:48726873-48744889
Segmented on allele freqs in 10:51596590-51605590
Segmented on allele freqs in 10:51836557-51847057
Segmented on allele freqs in 10:51902548-52000562
Segmented on allele freqs in 10:102027254-102068734
Segmented on allele freqs in 11:4968335-4976833
Segmented on allele freqs in 11:55431728-55457653
Segmented on allele freqs in 12:6040389-6041891
Segmented on allele freqs in 12:129572596-129574087
Segmented on allele freqs in 15:21312735-21318466
Segmented on allele freqs in 15:23673463-23678948
Segmented on allele freqs in 15:28661636-28668167
Segmented on allele freqs in 15:83181855-83213852
Segmented on allele freqs in 15:102310961-102327974
Segmented on allele freqs in 15:102432756-102445801
Segmented on allele freqs in 16:15200398-15206397
Segmented on allele freqs in 16:15206396-15239885
Segmented on allele freqs in 16:21779042-21808412
Segmented on allele freqs in 16:21808411-21849400
Segmented on allele freqs in 16:21849399-21877392
Segmented on allele freqs in 16:21877391-21941374
Segmented on allele freqs in 16:28683604-28722605
Segmented on allele freqs in 16:90173587-90184588
Segmented on allele freqs in 17:5625588-5631531
Segmented on allele freqs in 17:34593535-34651846
Segmented on allele freqs in 17:36323638-36333158
Segmented on allele freqs in 17:75992746-76103181
Segmented on allele freqs in 17:76282073-76283573
Segmented on allele freqs in 17:76416376-76450397
Segmented on allele freqs in 17:77462714-77623550
Segmented on allele freqs in 19:843093-863094
Segmented on allele freqs in 19:8793868-8851875
Segmented on allele freqs in 19:55307346-55329840
Segmented on allele freqs in 19:56241253-56264380
Segmented on allele freqs in 21:11000029-11007527
Segmented on allele freqs in 21:11007526-11020521
Segmented on allele freqs in 21:15140596-15192721
Segmented on allele freqs in 22:21516539-21520034
Segmented on allele freqs in X:49176492-49211495
Segmented on allele freqs in X:72068157-72164174
Segmented on allele freqs in X:129309442-129652186
Segmented on allele freqs in X:154588770-154715080
Traceback (most recent call last):
 File "/Dedicated/jmichaelson-wdata/bcbio/anaconda/bin/cnvkit.py", line 13, in <module>
     args.func(args)
 File "/Dedicated/jmichaelson-wdata/bcbio/anaconda/lib/python2.7/site-packages/cnvlib/commands.py", line 726, in _cmd_segment
     processes=args.processes)
 File "/Dedicated/jmichaelson-wdata/bcbio/anaconda/lib/python2.7/site-packages/cnvlib/segmentation/__init__.py", line 34, in do_segmentation
     save_dataframe, rlibpath)
 File "/Dedicated/jmichaelson-wdata/bcbio/anaconda/lib/python2.7/site-packages/cnvlib/segmentation/__init__.py", line 130, in _do_segmentation
     transfer_fields(segarr, cnarr)
 File "/Dedicated/jmichaelson-wdata/bcbio/anaconda/lib/python2.7/site-packages/cnvlib/segmentation/__init__.py", line 179, in transfer_fields
     segweights[i] = subprobes['weight'].sum()
 File "/Dedicated/jmichaelson-wdata/bcbio/anaconda/lib/python2.7/site-packages/cnvlib/gary.py", line 130, in __getitem__
     return self.data[index]
 File "/Dedicated/jmichaelson-wdata/bcbio/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 1997, in __getitem__
     return self._getitem_column(key)
 File "/Dedicated/jmichaelson-wdata/bcbio/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 2004, in _getitem_column
     return self._get_item_cache(key)
 File "/Dedicated/jmichaelson-wdata/bcbio/anaconda/lib/python2.7/site-packages/pandas/core/generic.py", line 1350, in _get_item_cache
     values = self._data.get(item)
 File "/Dedicated/jmichaelson-wdata/bcbio/anaconda/lib/python2.7/site-packages/pandas/core/internals.py", line 3290, in get
     loc = self.items.get_loc(item)
 File "/Dedicated/jmichaelson-wdata/bcbio/anaconda/lib/python2.7/site-packages/pandas/indexes/base.py", line 1947, in get_loc
     return self._engine.get_loc(self._maybe_cast_indexer(key))
 File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)
 File "pandas/index.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)
 File "pandas/hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)
 File "pandas/hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)
KeyError: 'weight'
' returned non-zero exit status 1
etal commented 8 years ago

The weight column might be getting dropped when CNVkit segments the allele frequencies -- is there a way to disable that step in bcbio, i.e. not pass the VCF to the segment command?

etal commented 8 years ago

I added a unit test for segmentation with a VCF, and it passes on my machine under the current CNVkit and pandas versions 0.18.1 and 0.19 on Python 3. That suggests the error may depend on the environment, system configuration, or versions of installed dependencies. In particular, pandas v0.19 was released earlier this week and may have changed how DataFrame columns are added or filled during some operations. Does any of this sound plausible?

tkoomar commented 8 years ago

Yes, it seems more and more likely that the error is not in the cnvkit.py segment command itself:

etal commented 8 years ago

Could you try running the original CNVkit command by itself, outside of bcbio but using the same cnvkit.py? It was:

set -o pipefail
unset R_HOME && \
  export PATH=/Dedicated/jmichaelson-wdata/bcbio/anaconda/bin:$PATH && \
  /Dedicated/jmichaelson-wdata/bcbio/anaconda/bin/cnvkit.py segment \
    /Dedicated/jmichaelson-sdata/SLI_WGS/batch10/fq/sample_51/structural/sample51/cnvkit/raw/sample51-sort.cnr \
    -v /Dedicated/jmichaelson-sdata/SLI_WGS/batch10/fq/sample_51/freebayes/sample51-effects-filter-sample51.vcf.gz \
    -o /Dedicated/jmichaelson-sdata/SLI_WGS/batch10/fq/sample_51/structural/sample51/cnvkit/raw/tx/tmpLVW5Ul/sample51-sort.cns \
    --threshold 0.00001

This might emit some more warning messages to indicate what happened.

tkoomar commented 8 years ago

Here is a gist of output

Not a lot more detail than the bcbio debug log provides by itself, but hopefully there's something I'm just not picking up.

etal commented 8 years ago

I think I've identified the problem and fixed it in the development version of CNVkit. Are you able to test that directly? If not, I'll roll another release soon so it can be included in bcbio.

tkoomar commented 8 years ago

For the time being, I have removed CNVkit from my bcbio pipeline, but I will try to get a standalone development version of CNVkit running to do a bit more testing.

etal commented 8 years ago

I've released a new version of CNVkit with this putative fix. The conda build should be available in a few minutes or hours; care to update and try it out once it lands?

etal commented 7 years ago

The conda build for CNVkit 0.8.1 on Linux should be available now.

chapmanb commented 7 years ago

Thanks Eric for the fix. It looks like 0.8.1 resolves the issues based on feedback in #1647 so I'll close and we can re-open if anyone runs into additional issues.