Closed pengxiao78 closed 7 years ago
Sorry about the issue with your CNVkit run. I haven't seen this before so am not totally sure what happened. It looks like CNVkit is complaining about mismatches between the cnr and target regions files, but I'm not sure under what conditions this might happen. @etal, do you have any pointers/directions we might look to dig further into what happened?
Hi, sorry for the trouble. Was the reference file 10030Germ_background.cnn
generated by the bcbio pipeline automatically during this run, or supplied in some other way where a mismatch would be possible? The error is saying that the on-target (1_2015-11-18_project-merged-sort.targetcoverage.cnn
) and off-target (1_2015-11-18_project-merged-sort.antitargetcoverage.cnn
) bin coverages include more genomic regions than the provided reference (10030Germ_background.cnn
), which should have been constructed from similar files covering the same bins.
Possible clues:
I'm happy to look at the three .cnn
files that caused this error if you're able to share them -- either attach the files to this issue.
Hi @pengxiao78
I am closing this because it seems an old issue. Come back if you find other issues or want to continue with this one.
cheers
Hi may I reopen this issue as I have faced the same problems (but I am missing 1767 bins instead)? Else let me know if I should open a new issue.
Here are your requested files: cnn.zip To answer your questions:
Thank you!
Thanks for the details, @phu5ion. I've confirmed the issue on CNVkit 0.8.4: it works with the flat reference, but not the pooled reference you provided. I'm looking into the details now.
It looks like the antitargets are the same across both references, but the targets are binned a little differently in the pooled references versus the sample target coverages and the flat reference. The pooled reference has 6707 on-target bins with an average bin size of 55 bp, whereas the latter two have 6500 bins at 57 bp.
Descriptive statistics for bin sizes:
reference-flat.fg.bed: | reference-pool.fg.bed: | Sample.targetcoverage.bed:
count 6500 | count 6707 | count 6500
mean 56.950769 | mean 55.193082 | mean 56.950769
std 6.566856 | std 6.263345 | std 6.566856
min 5 | min 5 | min 5
25% 54 | 25% 52 | 25% 54
50% 57 | 50% 55 | 50% 57
75% 59.25 | 75% 58 | 75% 59.25
max 85 | max 82 | max 85
In CNVkit, the on-target bin boundaries are defined by the target
command, and these boundaries are kept by the subsequent coverage
, reference
and fix
commands that create the sample .cnn, reference .cnn (pooled or flat), and .cnr files. As I understand it, bcbio only runs target
once. So in your bcbio run, some file(s) must have existed from a previous run and were reused -- the target BED, sample .cnn files used to create the pooled reference, or the pooled reference itself. The CNVkit binning algorithm changed slightly in version 0.8.3, so the previous run must have been done before that upgrade.
Alternatively: bcbio now uses CNVkit's coverage_bin_size.py
script to choose bin sizes. For hybrid capture this script uses random sampling, i.e. doesn't always give exactly the same bin size for the same input. Maybe this script is being used again in each bcbio run to recalculate the bin size for splitting the targets?
Exactly one commit after a stable release, coverage_bin_size.py
now gives repeatable results. Sorry for the trouble.
Eric -- thanks so much for tracking this down and for the fix. I'll bump the bioconda cnvkit version bcbio installs to 0.8.5.dev0 with this fix to avoid the problem. Thanks again.
Oops, I tried upgrading CNVkit to 0.8,5.dev(), upgrading bcbio (it's still stated as 1.0.2a0 but I'm sure I got the newest upgrade), deleting the whole "structural" folder and rerunning bcbio, and there was an error (below). Seems like there's something wrong with the coverage_bin_size.py script?
[2017-02-20T02:57Z] CNVkit coverage bin estimation
[2017-02-20T02:57Z] Estimated read length 150.0
[2017-02-20T02:57Z] Traceback (most recent call last):
[2017-02-20T02:57Z] File "/mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/bin/coverage_bin_size.py", line 158, in
The method is_monotonic_increasing
is in pandas 0.19.1 and later; can you update pandas?
Thanks Eric! After I upgraded pandas with the command
bcbio_conda install pandas
CNVkit seems to be running fine for now.
Hi,
The same problem is back although cnvkit was updated: [dlho@n010]cnvkit.py version 0.8.5.dev0
Processing target: 986_100215_Ovary_M-sort
Traceback (most recent call last):
File "/mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/bin/cnvkit.py", line 13, in
It seems to be the same bin size problem. Kindly see attached the .cnn files. cnn.zip
I have noticed that in the cases this happened (when I first reported this issue, and now), both times the failing sample was one that belonged to the second one of a multiple batch (2 tumour to 1 normal). Is it possible that somehow new bins were generated for this second sample, yet the old normal files (used for the first sample) was used to compare again?
Thanks much for the discussion and detailed debugging. I pushed a fix to the latest development which I hope will resolve this by assigning batch names to the target/antitarget to prevent re-using different bins when we have multiple normals attached to a single tumor. It's written to be back compatible so you need to remove the problem target/antitarget BEDs that are currently present and then can re-run. Thank you again and please let us know if you run into any other issues.
I may have the same issue for tumor/normal batches where in each batch exactly one of a set of three normals is shared between batches. @phu5ion, did the patch fix your problem?
Also, cnvkit 0.85 is out now with changes to the antitarget code and depreciation of some other parts: https://github.com/etal/cnvkit/releases/tag/v0.8.5 Perhaps that fixes things as well.
Sven-Eric -- I updated the CNVkit bioconda recipe to 0.8.5 but this was a bug in bcbio that the latest development fixes. We were re-using the normal binning from a previous tumor/normal batch which could end up having different target bin sizes. Now we explicitly name these by batch to avoid the issue. You will have to remove existing normal bins so it will re-generate them (it tries to be back compatible to avoid re-computing these for users not doing multi-tumor one-normal batching) but otherwise should hopefully fix the issue for you.
@schelhorn Yes the bcbio patch works to solve the problem of the shared normals.
I am afraid I have to reopen that topic. Getting the same cnvkit error "Reference is missing 713730 bins found in P2_TA". running multiple tumor:normal batches that share normal, but I thought that issue was fixed... Using bcbio 1.0.9. cnvkit is 0.9.3. Hope you can help.
Sorry about the issues. Would you be able to post the full traceback of the error you're seeing? That would help me identify the step in the process and try to help dig into debugging what is going wrong. We've reworked CNVkit binning and preparation and it should be binning all overlapping batches together. The large number of bin differences you're seeing indicates to me there might be something else happening here. Thanks much for the help debugging.
Here is the tail of my error log where the error occurred. Note that I am analyzing exomes, which could explain the high number of bins (?). Thanks for your help. error.tail.log
Thanks much, this is a big help for getting started. I'm working on reproducing this so I can try to debug what is happening and a couple of other things could help:
The CNVkit commands run:
grep cnvkit.py log/bcbio-nextgen-commands.log
This would help figure out what went into the background-5-cnvkit.cnn
file relative to the P2_TB*coverage.cnn
files.
The portion of your configuration YAML related to the P2_TB
batch. This would help me identify the layout of this batch relative to other inputs.
Thanks again for the information and help debugging this.
thanks for looking into this. The cnvkit commands are attached. The yams for P2_TB is below
`
Thanks much for following up with all these details. The YAML and the command runs for this look correct to me on first glance; this doesn't appear to have any kind of complex batching which could potentially throw things off and the same input files are used for both the target/antitarget and background generation.
The thing I noticed from the logs is that it looks like there were a couple of runs of this project based on the timestamps. It looks like there may have been some kind of failure during the first run in the middle of background calculation for this step, then it was re-run later and the target/antitarget BEDs got recalculated. Was there something problematic about the run that might have triggered this?
My thought at this point would be to look at the input files to determine if any are truncated:
/scratch/EBSCC/structural/P2_TB/bins/P2_TB-target-coverage.cnn
/scratch/EBSCC/structural/P2_TB/bins/P2_TB-antitarget-coverage.cnn
/scratch/EBSCC/structural/P2_TB/bins/background-5-cnvkit.cnn
if so, removing those and getting them regenerated by a clean re-run of the project might help resolve the issue. Hope this helps some for debugging.
Yes. there were several attempts to run. in One of them I deleted the whole structural folder to start from scratch, but still lead to the error. I only gave you part of the YAML, but there are multiple batches in my run and some share the same normals. After counting the number of bins for each sample, the number of bins matches between P2_N and P2_TA (batch P2_TAN), but not between P2_N and P2_TB (batch P2_TBN), which may be the source of the error (see below). I thought this bug was solved (earlier in the thread) and that bins were generated separately for each batch. but seeing this, I am not sure this is true.
{[oharismendy@Oncogx-0020 structural]$ wc -l /bins/ 6136 P1_N/bins/P1_N-antitarget-coverage.cnn 741271 P1_N/bins/P1_N-target-coverage.cnn 747406 P1_TA/bins/background-3-cnvkit.cnn 6136 P1_TA/bins/P1_TA-antitarget-coverage.cnn 579141 P1_TA/bins/P1_TA-normalized.cnr 741271 P1_TA/bins/P1_TA-target-coverage.cnn 8170 P2_N/bins/P2_N-antitarget-coverage.cnn 702668 P2_N/bins/P2_N-target-coverage.cnn 710837 P2_TA/bins/background-0-cnvkit.cnn 8170 P2_TA/bins/P2_TA-antitarget-coverage.cnn 548055 P2_TA/bins/P2_TA-normalized.cnr 702668 P2_TA/bins/P2_TA-target-coverage.cnn 710837 P2_TB/bins/background-5-cnvkit.cnn 8732 P2_TB/bins/P2_TB-antitarget-coverage.cnn 714425 P2_TB/bins/P2_TB-target-coverage.cnn 5632 P3_N/bins/P3_N-antitarget-coverage.cnn 864633 P3_N/bins/P3_N-target-coverage.cnn 5632 P3_TA/bins/P3_TA-antitarget-coverage.cnn 864633 P3_TA/bins/P3_TA-target-coverage.cnn 10789 P4_N/bins/P4_N-antitarget-coverage.cnn 556056 P4_N/bins/P4_N-target-coverage.cnn 8468 P4_TA/bins/P4_TA-antitarget-coverage.cnn 615037 P4_TA/bins/P4_TA-target-coverage.cnn 4932 P4_TA_D287/bins/P4_TA_D287-antitarget-coverage.cnn 946653 P4_TA_D287/bins/P4_TA_D287-target-coverage.cnn 566844 P4_TA_EB62/bins/background-1-cnvkit.cnn 10789 P4_TA_EB62/bins/P4_TA_EB62-antitarget-coverage.cnn 521924 P4_TA_EB62/bins/P4_TA_EB62-normalized.cnr 556056 P4_TA_EB62/bins/P4_TA_EB62-target-coverage.cnn 8731 P5_N/bins/P5_N-antitarget-coverage.cnn 621334 P5_N/bins/P5_N-target-coverage.cnn 8731 P5_TA/bins/P5_TA-antitarget-coverage.cnn 621334 P5_TA/bins/P5_TA-target-coverage.cnn 737383 P5_TB_D831/bins/background-4-cnvkit.cnn 3768 P5_TB_D831/bins/P5_TB_D831-antitarget-coverage.cnn 737383 P5_TB_D831/bins/P5_TB_D831-normalized.cnr 733616 P5_TB_D831/bins/P5_TB_D831-target-coverage.cnn 758031 P5_TB_EB53/bins/background-2-cnvkit.cnn 9329 P5_TB_EB53/bins/P5_TB_EB53-antitarget-coverage.cnn 758031 P5_TB_EB53/bins/P5_TB_EB53-normalized.cnr 748703 P5_TB_EB53/bins/P5_TB_EB53-target-coverage.cnn 18220375 total }
Sorry for the delay in following up on this, and thanks for the additional details. My guess from your description of the run is that you've got a different set of targets prepared due to the re-runs. The inputs to these coverage calculations also appear under coverage
so I wonder if it's possible that components of this were calculated with different batching during the re-runs, causing the disconnect?
I'm sorry to not have a good clue here but have not had any luck reproducing so either don't understand the component of your setup that triggers the issue, or something went wrong during the re-runs. If you're still stuck on this, is re-running from a clean project an option? If you're able to isolate the issue to a smaller number of batches like you identified and can reproduce with those, seeing the whole configuration might let me identify what went wrong.
Thanks again for the help debugging and hope this helps.
Thanks Brad. I think one thing I was doing wrong was to repeat a sample when it belonged to different batches, instead of listing all batches at once. It solved my problem.
Great news, I'm glad this helped some and your analysis got finished. Thanks for following up.
Hi, I am running tumor-normal somatic mutation pipeline and added svcall: cnvkit but found the following error.
[2015-11-21T23:55Z] .edu: Timing: structural variation initial [2015-11-21T23:55Z] .edu: ipython: detect_sv [2015-11-22T09:04Z] *.edu: Uncaught exception occurred Traceback (most recent call last): File "./bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run _do_run(cmd, checks, log_stdout) File "./bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run raise subprocess.CalledProcessError(exitcode, error_msg) CalledProcessError: Command './bcbio/anaconda/bin/cnvkit.py fix -o /project/work/structural/10030FFPE/cnvkit/raw/tx/tmpUVoi6n/1_2015-11-18_project-merged-sort.cnr /project/work/structural/10030FFPE/cnvkit/raw/1_2015-11-18_project-merged-sort.targetcoverage.cnn /project/work/structural/10030FFPE/cnvkit/raw/1_2015-11-18_project-merged-sort.antitargetcoverage.cnn /project/work/structural/10030FFPE/cnvkit/raw/10030Germ_background.cnn Processing target: 1_2015-11-18_project-merged-sort Traceback (most recent call last): File "./bcbio/anaconda/bin/cnvkit.py", line 9, in
args.func(args)
File "./bcbio/anaconda/lib/python2.7/site-packages/cnvlib/commands.py", line 564, in _cmd_fix
args.do_gc, args.do_edge, args.do_rmask)
File "./bcbio/anaconda/lib/python2.7/site-packages/cnvlib/commands.py", line 574, in do_fix
do_gc, do_edge, False)
File "./bcbio/anaconda/lib/python2.7/site-packages/cnvlib/fix.py", line 24, in load_adjust_coverages
ref_matched = match_ref_to_probes(ref_pset, pset)
File "./bcbio/anaconda/lib/python2.7/site-packages/cnvlib/fix.py", line 88, in match_ref_to_probes
% (num_missing, probes.sample_id))
ValueError: Reference is missing 1674 bins found in 1_2015-11-18_project-merged-sort
' returned non-zero exit status 1