bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
983 stars 354 forks source link

cnvkit segmentation fault #2127

Closed MBeyens closed 6 years ago

MBeyens commented 6 years ago

Hallo

I'm trying out the newest development code, and got an segmentation fault error when running the cnvkit segmentation functionality. Running on single core, gives practically the same error output. https://gist.github.com/MBeyens/f5869c46d42d824429078afffebb9c30

Further, I was wondering if it was possible to set more than normal sample for comparison with the tumor (grouping normals for CNV calling is stronger), as this functionality is working on the given "batch" naming. This is problematic for paired variant calling, as only one normal is given by the "batch" and you do not want to include none patient-specific normal tissue.

Thank you in advance

Matthias

chapmanb commented 6 years ago

Matthias; Sorry about the CNVkit issues. I'm cc'ing in @etal to see if he has any ideas but just a segfault doesn't give much to go on. Is it possible to reproduce the error outside of bcbio:

export PATH=/mnt/home_nfs1/mbeyens/local/share/bcbio/anaconda/bin:$PATH && /mnt/home_nfs1/mbeyens/local/share/bcbio/anaconda/bin/cnvkit.py segment -p 1 -o P00132893_tumor-sort-1.cns /mnt/home_nfs1/shared_data_core/PNET/10PNET_FF_gitta/raw_data/paired_analysis/work/structural/P00132893_tumor/cnvkit/raw/P00132893_tumor-sort-1.cnr -v /mnt/home_nfs1/shared_data_core/PNET/10PNET_FF_gitta/raw_data/paired_analysis/work/mutect2/1-effects-annotated.vcf.gz

If so, can you try removing the -v variant option to see if that might be causing the problem? If so and you could share this file we could try and reproduce and fix from there.

We're definitely planning to add more flexibility around defining the background samples. We've been working on generalizing the coverage estimation in this release and this is something we hope to tackle in the next couple of weeks. Right now it's just tumor/normal based on the batch but we'll plan to introduce new ways to manage this so you can specify both variant calling and different batching for CNV calling. Thanks for the suggestion and discussion.

MBeyens commented 6 years ago

Hi Brad, I ran the command without the -v option, and it generated the segmentation file. So the -v option/MuTect2 VCF seems the problem at the moment. I also added the MuTect2 VCF and the P00132893_tumor-sort-1.cnr. Let me know if you need more information!

etal commented 6 years ago

If you can reproduce the error outside bcbio you'll see more log messages, I think. As it is the segfault could be happening inside DNAcopy (R package), pysam, or pandas, theoretically. Could hitting a memory limit result in a segfault?

chapmanb commented 6 years ago

Matthias -- thanks for the example and files. Eric, thanks much for the thoughts. I wasn't able to reproduce this example files provided, things processed and correctly segmented without a segfault. I'm not sure what is different about our systems but here are the versions of the relevant libraries I have:

$ bcbio_conda list | grep pysam
pysam                     0.11.2.2                 py27_1    bioconda
$ bcbio_conda list | grep cnvkit
cnvkit                    0.9.0                    py27_0    bioconda
$ bcbio_conda list | grep pscbs
r-pscbs                   0.63.0                 r3.4.1_0    bioconda

If those match what you're seeing, is it possible to paste the output of the standalong run when it segfaults? That would help see exactly where it dies and might help isolate the issue. Thanks for the help debugging.

MBeyens commented 6 years ago

I have the same library versions. I don't get your point of running the code as a "standalone" or "outside bcbio"... I do not restart the bcbio run on itself, just the single command cnvkit.py segment -p 1 -o P00132893_tumor-sort-1.cns P00132893_tumor-sort-1.cnr -v 1-effects-annotated.vcf.gz. What I tried was running it with different segmentation algorithms -m option: cbs (default), haar, flasso, and keep on getting the Segmentation fault. If you clarify what you mean with running "outside bcbio", I hope a can help you guys.

chapmanb commented 6 years ago

Matthias; Sorry for the confusion. By outside bcbio I meant just running the cnvkit.py segment command line standalone which you're doing. Does it generate any output at all or just segfault immediately when you run it? The output I see is:

$ cnvkit.py segment -o out.cns P00132893_tumor-sort-1.cnr -v variants.vcf.gz
Selected test sample P00132893_tumor and control sample P00132893_normal
Loaded 3863 records; skipped: 0 somatic, 609 depth
VCF normal sample's genotypes are all 0/0 or missing; inferring genotypes from allele frequency instead
Skipping 203 additional somatic record based on T/N genotypes
Kept 18 heterozygous of 3660 VCF records
Segmenting with method 'cbs', significance threshold 0.0001
Dropped 6 outlier bins:
  chromosome     start       end               gene     log2  depth    weight
0       chr8   4852166   4852268              CSMD1 -24.0052    0.0  0.210000
1       chr8  57906215  57906317             IMPAD1 -23.8567    0.0  0.126040
2      chr10  17496143  17496295            ST8SIA6 -23.9927    0.0  0.207660
3      chr11   1578131   1578377              DUSP8 -23.6502    0.0  0.221612
4      chr15  45479899  45480149  RP11-519G16.2,SHF -24.0507    0.0  0.340701
5      chr17  21280113  21280225             KCNJ12 -24.0147    0.0  0.279605
Wrote out.cns with 113 regions

Sorry for all the back and forth, I don't have a good idea right now why it fails on your system so trying to come up with a reproducible example here so we can help fix. Thanks much for the help debugging.

MBeyens commented 6 years ago

I get only the following: Selected test sample P00132893_tumor and control sample P00132893_normal Segmentation fault That's it actually =/

etal commented 6 years ago

The function skgenome.tabio.vcfio.read_vcf prints with the message you saw after parsing the VCF header, but before parsing the VCF body with pysam. After parsing is finished, you would see another message: "Loaded records; skipped: somatic, _ depth" -- but you didn't, so it looks like the segfault happened while parsing the VCF body, i.e. within pysam.VariantFile at one of its calls into htslib.

Could the linked htslib shared library have been built with a different compiler/platform/environment?

MBeyens commented 6 years ago

I do not have any clue. The running htslib version is 1.6. Is this correct? Maybe an option would be recompiling the htslib?

chapmanb commented 6 years ago

Matthias; Apologies, but I'm stumped on what is going on here and why we can't reproduce the error. I pushed some speculative fixes that I hope might avoid the problem -- updating pysam to explicitly compile against htslib 1.6 and improving how bcbio passes tumor/normal pairs to CNVkit segmentation. Could you try updating with bcbio_nextgen.py upgrade -u development and re-running to see if that helps with the problem?

If not, is it possible the file you sent is somehow different from the input file? I know you had unzipped it, to submit as a gist. Maybe there is something problematic with the bgzip preparation as well. If you could send to me directly (e-mail in GitHub profile) or post the original file in Dropbox or Google Drive maybe that would provide some insight.

Sorry to not know exactly what is going on. Fingers crossed these fixes help.

MBeyens commented 6 years ago

Hi Brad, I checked the bgzip functionality, seems not the problem. Segmentation fault keeps on the uncompressed vcf. Also, no errors occurred during the compression and rebuilding it did not solved it either. Next, I updated to the latest development version, and now everything works well. I saw you now explicitly specify unset R path and R lib, so maybe the issue is in there... I'll dig into this later on and if I've more information I'll update you about this.

Thank you for the problem solving @chapmanb and @etal

Biobarani22 commented 6 years ago

Hi guys, I need your valuable help. After running CNVKIT gain-loss function by setting CN cutoff of 0.6 in the segmentation file, GISTIC showing following error message "1472 segment overlaps detected in the file, First overlap detected between segments at lines 9888 and 9889". Your help is highly appreciated. Thank you very much in advance.

chapmanb commented 6 years ago

Thanks for the report -- is this a problem running CNVkit through bcbio, or with a by hand run of CNVkit? If the later, you'd be best off reporting this on the CNVkit issue tracker (https://github.com/etal/cnvkit/issues) since they'll have a better ability to debug. Either way, we could use additional information about the problem to help reproduce identify the issue:

Hope this helps.

Biobarani22 commented 6 years ago

@chapmanb, Thanks for your timely reply. I ran CNVKIT, in by hand run only. As u suggested I also posted on the CNVkit issue tracker. Hope I will get some suggestions from there. Thank you once again.

chapmanb commented 6 years ago

Matthias; Following up on your initial question about pooling normal samples, the latest development version of bcbio will now use all normal samples in a project for calculating the background for tumor normalization. It batches these by variant regions BEDs but you can also specify more specifically by adding a prep_method definition to the metadata if it shouldn't batch some samples:

http://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#sample-information

Happy for any feedback if you have time to test this on your samples. Thanks again for the discussion and ideas.

MBeyens commented 6 years ago

That's great news! Thank you for the implementation @chapmanb.

So to recapitulate, I need to set the prep_method parameter both in tumor/disease and normal/control sample metadata, and I can put a text like prep_method: pool1 for tumor/diseased CNV calling to the pooled normal/control samples background defined as pool1? Like the added code groups[(cnv_file, dd.get_prep_method(data))].append((items, data, work_dir)) .

By default it is [empty], I assume the CNV calling is done only patient-wise, e.g. tumor/disease1 vs normal/control1 (and thus not a "pooled" normal)?

If I'm interpreting this correctly, I'll test this right away! This functionality is available in the latest development version, right?

chapmanb commented 6 years ago

Matthias; All sounds right, except if you leave prep_method empty then it will pool all normals together with the same capture/region BEDs. So it'll try to do the right thing without any changes, but you can define more specifically into pools as needed. It is in the latest development (bcbio_nextgen.py upgrade -u development). Please let us know if you run into any problems and thanks again for testing.

MBeyens commented 6 years ago

Hi Brad; I'm testing the newest functionality you implemented and got the following error.

[2017-11-17T20:55Z] CNVkit fix [2017-11-17T20:55Z] usage: cnvkit.py [-h] [2017-11-17T20:55Z] {batch,target,access,antitarget,autobin,coverage,reference,fix,segment,call,diagram,scatter,heatmap,breaks,gainloss,sex,gender,metrics,segmetrics,import-picard,import-seg,import-theta,export,version}\n [2017-11-17T20:55Z] ... [2017-11-17T20:55Z] cnvkit.py: error: unrecognized arguments: --sample-id P00138533_tumor [2017-11-17T20:55Z] Uncaught exception occurred Traceback (most recent call last): File "/home/mbeyens/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 23, in run _do_run(cmd, checks, log_stdout, env=env) File "/home/mbeyens/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 103, in _do_run raise subprocess.CalledProcessError(exitcode, error_msg) CalledProcessError: Command 'set -o pipefail; export TMPDIR=/mnt/home_nfs1/paired_analysis/work/bcbiotx/tmplUC5db && /mnt/home_nfs1/local/share/bcbio/anaconda/bin/cnvkit.py fix -o /mnt/home_nfs1/paired_analysis/work/bcbiotx/tmplUC5db/P00138533_tumor-normalized.cnr /mnt/home_nfs1/paired_analysis/work/structural/P00138533_tumor/bins/P00138533_tumor-target-coverage.cnn /mnt/home_nfs1/paired_analysis/work/structural/P00138533_tumor/bins/P00138533_tumor-antitarget-coverage.cnn /mnt/home_nfs1/paired_analysis/work/structural/P00138533_tumor/bins/background-0-cnvkit.cnn --sample-id P00138533_tumor usage: cnvkit.py [-h] {batch,target,access,antitarget,autobin,coverage,reference,fix,segment,call,diagram,scatter,heatmap,breaks,gainloss,sex,gender,metrics,segmetrics,import-picard,import-seg,import-theta,export,version} ... cnvkit.py: error: unrecognized arguments: --sample-id P00138533_tumor ' returned non-zero exit status 2

It appears --sample-id is a none existing/available argument under the cnvkit.py fix. Currently running cnvkit version 0.9.0. Hopefully this helps you to debug, otherwise I provide more information.

etal commented 6 years ago

I think that option is only in the recent CNVkit 0.9.1 release, which is not yet in the bioconda repo: bioconda/bioconda-recipes#6837

chapmanb commented 6 years ago

Matthias -- Eric is exactly right, you need 0.9.1 to have this functionality. If you're using bioconda, it has the latest 0.9.1a0 version which I built just prior to the release so updating to that will work right away. bcbio should install this version if you update with bcbio_nexgen.py upgrade -u development, bcbio_nextgen.py upgrade --tools or just bcbio_conda install -c conda-forge -c bioconda cnvkit. Hope this fixes it for you.