Reference missing bins - Githubissues

pengxiao78 commented 8 years ago

Hi, I am running tumor-normal somatic mutation pipeline and added svcall: cnvkit but found the following error.

[2015-11-21T23:55Z] .edu: Timing: structural variation initial [2015-11-21T23:55Z] .edu: ipython: detect_sv [2015-11-22T09:04Z] *.edu: Uncaught exception occurred Traceback (most recent call last): File "./bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run _do_run(cmd, checks, log_stdout) File "./bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run raise subprocess.CalledProcessError(exitcode, error_msg) CalledProcessError: Command './bcbio/anaconda/bin/cnvkit.py fix -o /project/work/structural/10030FFPE/cnvkit/raw/tx/tmpUVoi6n/1_2015-11-18_project-merged-sort.cnr /project/work/structural/10030FFPE/cnvkit/raw/1_2015-11-18_project-merged-sort.targetcoverage.cnn /project/work/structural/10030FFPE/cnvkit/raw/1_2015-11-18_project-merged-sort.antitargetcoverage.cnn /project/work/structural/10030FFPE/cnvkit/raw/10030Germ_background.cnn Processing target: 1_2015-11-18_project-merged-sort Traceback (most recent call last): File "./bcbio/anaconda/bin/cnvkit.py", line 9, in args.func(args) File "./bcbio/anaconda/lib/python2.7/site-packages/cnvlib/commands.py", line 564, in _cmd_fix args.do_gc, args.do_edge, args.do_rmask) File "./bcbio/anaconda/lib/python2.7/site-packages/cnvlib/commands.py", line 574, in do_fix do_gc, do_edge, False) File "./bcbio/anaconda/lib/python2.7/site-packages/cnvlib/fix.py", line 24, in load_adjust_coverages ref_matched = match_ref_to_probes(ref_pset, pset) File "./bcbio/anaconda/lib/python2.7/site-packages/cnvlib/fix.py", line 88, in match_ref_to_probes % (num_missing, probes.sample_id)) ValueError: Reference is missing 1674 bins found in 1_2015-11-18_project-merged-sort ' returned non-zero exit status 1

chapmanb commented 8 years ago

Sorry about the issue with your CNVkit run. I haven't seen this before so am not totally sure what happened. It looks like CNVkit is complaining about mismatches between the cnr and target regions files, but I'm not sure under what conditions this might happen. @etal, do you have any pointers/directions we might look to dig further into what happened?

etal commented 8 years ago

Hi, sorry for the trouble. Was the reference file 10030Germ_background.cnn generated by the bcbio pipeline automatically during this run, or supplied in some other way where a mismatch would be possible? The error is saying that the on-target (1_2015-11-18_project-merged-sort.targetcoverage.cnn) and off-target (1_2015-11-18_project-merged-sort.antitargetcoverage.cnn) bin coverages include more genomic regions than the provided reference (10030Germ_background.cnn), which should have been constructed from similar files covering the same bins.

Possible clues:

Are there exactly 1674 bins in any of the input files (.cnn, BED, etc.)? Are any of the input files empty (e.g. antitargets, if this was a targeted amplicon capture protocol)?
When the reference was built, did the log message say how many "target" and "antitarget" bins were dropped due to poor coverage?
Are there duplicated or overlapping bins in the "target" BED file?
Which version of bcbio-nextgen and CNVkit showed this error?

I'm happy to look at the three .cnn files that caused this error if you're able to share them -- either attach the files to this issue.

lpantano commented 7 years ago

Hi @pengxiao78

I am closing this because it seems an old issue. Come back if you find other issues or want to continue with this one.

cheers

phu5ion commented 7 years ago

Hi may I reopen this issue as I have faced the same problems (but I am missing 1767 bins instead)? Else let me know if I should open a new issue.

Here are your requested files: cnn.zip To answer your questions:

The reference file 852_271114_N_background.cnn was generated automatically by the bcbio pipeline.
I don't think there are exactly 1767 bins in any of the files. None of them are empty.
Targets: 125 (1.8637%) bins failed filters; Antitargets: 1968 (13.3424%) bins failed filters
I checked the -annotated.target.bed that bcbio pipeline generated for CNVkit, and there wasn't any duplicates. Besides, I have run cnvkit many times using the same target file, and this is the first time I encounter this issue.
Bcbio version: 1.0.2a0. CNVkit version: 0.8.3.dev0

Thank you!

etal commented 7 years ago

Thanks for the details, @phu5ion. I've confirmed the issue on CNVkit 0.8.4: it works with the flat reference, but not the pooled reference you provided. I'm looking into the details now.

etal commented 7 years ago

It looks like the antitargets are the same across both references, but the targets are binned a little differently in the pooled references versus the sample target coverages and the flat reference. The pooled reference has 6707 on-target bins with an average bin size of 55 bp, whereas the latter two have 6500 bins at 57 bp.

Descriptive statistics for bin sizes:

reference-flat.fg.bed: | reference-pool.fg.bed: | Sample.targetcoverage.bed:
count    6500          | count    6707          | count    6500
mean       56.950769   | mean       55.193082   | mean       56.950769
std         6.566856   | std         6.263345   | std         6.566856
min         5          | min         5          | min         5
25%        54          | 25%        52          | 25%        54
50%        57          | 50%        55          | 50%        57
75%        59.25       | 75%        58          | 75%        59.25
max        85          | max        82          | max        85

In CNVkit, the on-target bin boundaries are defined by the target command, and these boundaries are kept by the subsequent coverage, reference and fix commands that create the sample .cnn, reference .cnn (pooled or flat), and .cnr files. As I understand it, bcbio only runs target once. So in your bcbio run, some file(s) must have existed from a previous run and were reused -- the target BED, sample .cnn files used to create the pooled reference, or the pooled reference itself. The CNVkit binning algorithm changed slightly in version 0.8.3, so the previous run must have been done before that upgrade.

etal commented 7 years ago

Alternatively: bcbio now uses CNVkit's coverage_bin_size.py script to choose bin sizes. For hybrid capture this script uses random sampling, i.e. doesn't always give exactly the same bin size for the same input. Maybe this script is being used again in each bcbio run to recalculate the bin size for splitting the targets?

etal commented 7 years ago

Exactly one commit after a stable release, coverage_bin_size.py now gives repeatable results. Sorry for the trouble.

chapmanb commented 7 years ago

Eric -- thanks so much for tracking this down and for the fix. I'll bump the bioconda cnvkit version bcbio installs to 0.8.5.dev0 with this fix to avoid the problem. Thanks again.

phu5ion commented 7 years ago

Oops, I tried upgrading CNVkit to 0.8,5.dev(), upgrading bcbio (it's still stated as 1.0.2a0 but I'm sure I got the newest upgrade), deleting the whole "structural" folder and rerunning bcbio, and there was an error (below). Seems like there's something wrong with the coverage_bin_size.py script?

[2017-02-20T02:57Z] CNVkit coverage bin estimation [2017-02-20T02:57Z] Estimated read length 150.0 [2017-02-20T02:57Z] Traceback (most recent call last): [2017-02-20T02:57Z] File "/mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/bin/coverage_bin_size.py", line 158, in [2017-02-20T02:57Z] fields = hybrid(rc_table, read_len, args.bam, targets, access) [2017-02-20T02:57Z] File "/mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/bin/coverage_bin_size.py", line 51, in hybrid [2017-02-20T02:57Z] antitargets = access.subtract(targets) [2017-02-20T02:57Z] File "/mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/lib/python2.7/site-packages/cnvlib/genome/gary.py", line 583, in subtract [2017-02-20T02:57Z] return self.as_dataframe(subtract(self.data, other.data)) [2017-02-20T02:57Z] File "/mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/lib/python2.7/site-packages/cnvlib/genome/subtract.py", line 23, in subtract [2017-02-20T02:57Z] columns=table.columns) [2017-02-20T02:57Z] File "/mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 939, in from_records [2017-02-20T02:57Z] first_row = next(data) [2017-02-20T02:57Z] File "/mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/lib/python2.7/site-packages/cnvlib/genome/subtract.py", line 27, in _subtraction [2017-02-20T02:57Z] for keeper, rows_to_exclude in by_ranges(other, table, 'outer', True): [2017-02-20T02:57Z] File "/mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/lib/python2.7/site-packages/cnvlib/genome/intersect.py", line 29, in by_ranges [2017-02-20T02:57Z] subranges): [2017-02-20T02:57Z] File "/mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/lib/python2.7/site-packages/cnvlib/genome/intersect.py", line 89, in iter_ranges [2017-02-20T02:57Z] ) and not table.end.is_monotonic_increasing: [2017-02-20T02:57Z] File "/mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/lib/python2.7/site-packages/pandas/core/generic.py", line 2672, in getattr [2017-02-20T02:57Z] return object.getattribute(self, name) [2017-02-20T02:57Z] AttributeError: 'Series' object has no attribute 'is_monotonic_increasing'

etal commented 7 years ago

The method is_monotonic_increasing is in pandas 0.19.1 and later; can you update pandas?

phu5ion commented 7 years ago

Thanks Eric! After I upgraded pandas with the command bcbio_conda install pandas CNVkit seems to be running fine for now.

phu5ion commented 7 years ago

Hi,

The same problem is back although cnvkit was updated: [dlho@n010]cnvkit.py version 0.8.5.dev0

Processing target: 986_100215_Ovary_M-sort Traceback (most recent call last): File "/mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/bin/cnvkit.py", line 13, in args.func(args) File "/mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/lib/python2.7/site-packages/cnvlib/commands.py", line 447, in _cmd_fix args.do_gc, args.do_edge, args.do_rmask) File "/mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/lib/python2.7/site-packages/cnvlib/fix.py", line 19, in do_fix do_gc, do_edge, False) File "/mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/lib/python2.7/site-packages/cnvlib/fix.py", line 53, in load_adjust_coverages ref_matched = match_ref_to_sample(ref_cnarr, cnarr) File "/mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/lib/python2.7/site-packages/cnvlib/fix.py", line 123, in match_ref_to_sample % (num_missing, samp_cnarr.sample_id)) ValueError: Reference is missing 2561 bins found in 986_100215_Ovary_M-sort ' returned non-zero exit status 1

It seems to be the same bin size problem. Kindly see attached the .cnn files. cnn.zip

I have noticed that in the cases this happened (when I first reported this issue, and now), both times the failing sample was one that belonged to the second one of a multiple batch (2 tumour to 1 normal). Is it possible that somehow new bins were generated for this second sample, yet the old normal files (used for the first sample) was used to compare again?

chapmanb commented 7 years ago

Thanks much for the discussion and detailed debugging. I pushed a fix to the latest development which I hope will resolve this by assigning batch names to the target/antitarget to prevent re-using different bins when we have multiple normals attached to a single tumor. It's written to be back compatible so you need to remove the problem target/antitarget BEDs that are currently present and then can re-run. Thank you again and please let us know if you run into any other issues.

schelhorn commented 7 years ago

I may have the same issue for tumor/normal batches where in each batch exactly one of a set of three normals is shared between batches. @phu5ion, did the patch fix your problem?

schelhorn commented 7 years ago

Also, cnvkit 0.85 is out now with changes to the antitarget code and depreciation of some other parts: https://github.com/etal/cnvkit/releases/tag/v0.8.5 Perhaps that fixes things as well.

chapmanb commented 7 years ago

Sven-Eric -- I updated the CNVkit bioconda recipe to 0.8.5 but this was a bug in bcbio that the latest development fixes. We were re-using the normal binning from a previous tumor/normal batch which could end up having different target bin sizes. Now we explicitly name these by batch to avoid the issue. You will have to remove existing normal bins so it will re-generate them (it tries to be back compatible to avoid re-computing these for users not doing multi-tumor one-normal batching) but otherwise should hopefully fix the issue for you.

phu5ion commented 7 years ago

@schelhorn Yes the bcbio patch works to solve the problem of the shared normals.

oharismendy commented 6 years ago

I am afraid I have to reopen that topic. Getting the same cnvkit error "Reference is missing 713730 bins found in P2_TA". running multiple tumor:normal batches that share normal, but I thought that issue was fixed... Using bcbio 1.0.9. cnvkit is 0.9.3. Hope you can help.

chapmanb commented 6 years ago

Sorry about the issues. Would you be able to post the full traceback of the error you're seeing? That would help me identify the step in the process and try to help dig into debugging what is going wrong. We've reworked CNVkit binning and preparation and it should be binning all overlapping batches together. The large number of bin differences you're seeing indicates to me there might be something else happening here. Thanks much for the help debugging.

oharismendy commented 6 years ago

Here is the tail of my error log where the error occurred. Note that I am analyzing exomes, which could explain the high number of bins (?). Thanks for your help. error.tail.log

chapmanb commented 6 years ago

Thanks much, this is a big help for getting started. I'm working on reproducing this so I can try to debug what is happening and a couple of other things could help:

The CNVkit commands run:
```
grep cnvkit.py log/bcbio-nextgen-commands.log
```
This would help figure out what went into the background-5-cnvkit.cnn file relative to the P2_TB*coverage.cnn files.
The portion of your configuration YAML related to the P2_TB batch. This would help me identify the layout of this batch relative to other inputs.

Thanks again for the information and help debugging this.

oharismendy commented 6 years ago

thanks for looking into this. The cnvkit commands are attached. The yams for P2_TB is below

`

algorithm: aligner: bwa mark_duplicates: true platform: illumina quality_format: standard variantcaller: [freebayes, vardict, varscan, mutect] svcaller: [cnvkit] coverage_interval: /oncogxA/resources/annotations-and-tracks/exome/baits_targets/SureSelectClinicalExome_S06588914/S06588914_Regions.bed ensemble: numpass: 2 analysis: variant2 description: P2_N files: [/scratch/EBSCC/fastq/P2_N.fq1.gz, /scratch/EBSCC/fastq/P2_N.fq2.gz ] genome_build: GRCh37 metadata: batch: P2_TBN phenotype: normal
algorithm: aligner: bwa mark_duplicates: true platform: illumina quality_format: standard variantcaller: [freebayes, vardict, varscan, mutect] svcaller: [cnvkit] coverage_interval: /oncogxA/resources/annotations-and-tracks/exome/baits_targets/SureSelectClinicalExome_S06588914/S06588914_Regions.bed ensemble: numpass: 2 analysis: variant2 description: P2_TB files: [/scratch/EBSCC/fastq/P2_TB.fq1.gz, /scratch/EBSCC/fastq/P2_TB.fq2.gz ] genome_build: GRCh37 metadata: batch: P2_TBN phenotype: tumor ` cnkit.commands.log

chapmanb commented 6 years ago

Thanks much for following up with all these details. The YAML and the command runs for this look correct to me on first glance; this doesn't appear to have any kind of complex batching which could potentially throw things off and the same input files are used for both the target/antitarget and background generation.

The thing I noticed from the logs is that it looks like there were a couple of runs of this project based on the timestamps. It looks like there may have been some kind of failure during the first run in the middle of background calculation for this step, then it was re-run later and the target/antitarget BEDs got recalculated. Was there something problematic about the run that might have triggered this?

My thought at this point would be to look at the input files to determine if any are truncated:

/scratch/EBSCC/structural/P2_TB/bins/P2_TB-target-coverage.cnn
/scratch/EBSCC/structural/P2_TB/bins/P2_TB-antitarget-coverage.cnn
/scratch/EBSCC/structural/P2_TB/bins/background-5-cnvkit.cnn

if so, removing those and getting them regenerated by a clean re-run of the project might help resolve the issue. Hope this helps some for debugging.

oharismendy commented 6 years ago

Yes. there were several attempts to run. in One of them I deleted the whole structural folder to start from scratch, but still lead to the error. I only gave you part of the YAML, but there are multiple batches in my run and some share the same normals. After counting the number of bins for each sample, the number of bins matches between P2_N and P2_TA (batch P2_TAN), but not between P2_N and P2_TB (batch P2_TBN), which may be the source of the error (see below). I thought this bug was solved (earlier in the thread) and that bins were generated separately for each batch. but seeing this, I am not sure this is true.

{[oharismendy@Oncogx-0020 structural]$ wc -l /bins/ 6136 P1_N/bins/P1_N-antitarget-coverage.cnn 741271 P1_N/bins/P1_N-target-coverage.cnn 747406 P1_TA/bins/background-3-cnvkit.cnn 6136 P1_TA/bins/P1_TA-antitarget-coverage.cnn 579141 P1_TA/bins/P1_TA-normalized.cnr 741271 P1_TA/bins/P1_TA-target-coverage.cnn 8170 P2_N/bins/P2_N-antitarget-coverage.cnn 702668 P2_N/bins/P2_N-target-coverage.cnn 710837 P2_TA/bins/background-0-cnvkit.cnn 8170 P2_TA/bins/P2_TA-antitarget-coverage.cnn 548055 P2_TA/bins/P2_TA-normalized.cnr 702668 P2_TA/bins/P2_TA-target-coverage.cnn 710837 P2_TB/bins/background-5-cnvkit.cnn 8732 P2_TB/bins/P2_TB-antitarget-coverage.cnn 714425 P2_TB/bins/P2_TB-target-coverage.cnn 5632 P3_N/bins/P3_N-antitarget-coverage.cnn 864633 P3_N/bins/P3_N-target-coverage.cnn 5632 P3_TA/bins/P3_TA-antitarget-coverage.cnn 864633 P3_TA/bins/P3_TA-target-coverage.cnn 10789 P4_N/bins/P4_N-antitarget-coverage.cnn 556056 P4_N/bins/P4_N-target-coverage.cnn 8468 P4_TA/bins/P4_TA-antitarget-coverage.cnn 615037 P4_TA/bins/P4_TA-target-coverage.cnn 4932 P4_TA_D287/bins/P4_TA_D287-antitarget-coverage.cnn 946653 P4_TA_D287/bins/P4_TA_D287-target-coverage.cnn 566844 P4_TA_EB62/bins/background-1-cnvkit.cnn 10789 P4_TA_EB62/bins/P4_TA_EB62-antitarget-coverage.cnn 521924 P4_TA_EB62/bins/P4_TA_EB62-normalized.cnr 556056 P4_TA_EB62/bins/P4_TA_EB62-target-coverage.cnn 8731 P5_N/bins/P5_N-antitarget-coverage.cnn 621334 P5_N/bins/P5_N-target-coverage.cnn 8731 P5_TA/bins/P5_TA-antitarget-coverage.cnn 621334 P5_TA/bins/P5_TA-target-coverage.cnn 737383 P5_TB_D831/bins/background-4-cnvkit.cnn 3768 P5_TB_D831/bins/P5_TB_D831-antitarget-coverage.cnn 737383 P5_TB_D831/bins/P5_TB_D831-normalized.cnr 733616 P5_TB_D831/bins/P5_TB_D831-target-coverage.cnn 758031 P5_TB_EB53/bins/background-2-cnvkit.cnn 9329 P5_TB_EB53/bins/P5_TB_EB53-antitarget-coverage.cnn 758031 P5_TB_EB53/bins/P5_TB_EB53-normalized.cnr 748703 P5_TB_EB53/bins/P5_TB_EB53-target-coverage.cnn 18220375 total }

chapmanb commented 6 years ago

Sorry for the delay in following up on this, and thanks for the additional details. My guess from your description of the run is that you've got a different set of targets prepared due to the re-runs. The inputs to these coverage calculations also appear under coverage so I wonder if it's possible that components of this were calculated with different batching during the re-runs, causing the disconnect?

I'm sorry to not have a good clue here but have not had any luck reproducing so either don't understand the component of your setup that triggers the issue, or something went wrong during the re-runs. If you're still stuck on this, is re-running from a clean project an option? If you're able to isolate the issue to a smaller number of batches like you identified and can reproduce with those, seeing the whole configuration might let me identify what went wrong.

Thanks again for the help debugging and hope this helps.

oharismendy commented 6 years ago

Thanks Brad. I think one thing I was doing wrong was to repeat a sample when it belonged to different batches, instead of listing all batches at once. It solved my problem.

chapmanb commented 6 years ago

Great news, I'm glad this helped some and your analysis got finished. Thanks for following up.

bcbio / bcbio-nextgen

Reference missing bins #1116