Closed parlar closed 8 years ago
Pär; Sorry about the issue. It looks like something went wrong with the RTG vcfeval comparison and the output files are somehow truncated or corrupt, so the downstream processing failed. It's hard to tell what happened from this traceback, but you might want to look at:
/home/genetik/revalidation/calling_2016.01.13-15.14.13/work/validate/18-11/samtools/rtg/
to see what the tp.vcf.gz
and other output files look like. If they are problematic by inspection, you could try checking in log/bcbio-nextgen-commands.log
for the rtg vcfeval
commands and run these by hand to see if that produces any useful error message. My guess is it doesn't like something about your input VCF but hopefully that process would help identify exactly what is going on.
I just happened across this issue -- am I correct in interpreting from the backtrace that it's happening while validate.py is examining the call VCF for an appropriate score field, before it even gets to invoking vcfeval?
Assuming it does get as far as running vcfeval, if vcfeval creates a file called done
, then it thinks it has completed successfully (and in general we are also quite careful about returning appropriate exit codes in our tools too). (Brad, I'm not sure if you are checking either of those conditions). Anyway if that done file does not exist, the vcfeval.log
file in the output directory should indicate what went wrong.
Len -- thanks much for the information. bcbio does try to check that it finishes but I'm not sure what happens in this case. It seems like the initial validation file might be malformed from the errors that arose. Pär, is that a possibility? Happy to dig into this more with any other details.
Sorry for my disappearance from this issue. Too many things going on here at the same time I'm afraid. Thanks Len and Brad. I'll take a look at the validation file and try to clean it up. It's an ensemble call vcf file from bcbio-nextgen.
Still can't figure this out. I ran it again and it seems that it is the validation of sample 18-11 / ensemble calls that fail:
[2016-01-22T12:25Z] bgzip 18-11-ensemble-single.vt.vcf.gz
[2016-01-22T12:25Z] tabix index 18-11-ensemble-single.vt.vcf.gz
[2016-01-22T12:25Z] Select sample: 18-11
[2016-01-22T12:25Z] tabix index batch1-filter-18-11.vcf.gz
[2016-01-22T12:25Z] Resource requests: ; memory: 1.00; cores: 1
[2016-01-22T12:25Z] Configuring 10 jobs to run, using 1 cores each with 1.00g of memory reserved for each job
[2016-01-22T12:25Z] Intersect callable intervals for rtg vcfeval
Traceback (most recent call last):
File "/home/genetik/bcbio/bin/bcbio_nextgen.py", line 226, in <module>
main(**kwargs)
File "/home/genetik/bcbio/bin/bcbio_nextgen.py", line 43, in main
run_main(**kwargs)
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 39, in run_main
fc_dir, run_info_yaml)
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 82, in _run_toplevel
for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 166, in variant2pipeline
samples = run_parallel("compare_to_rm", samples)
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
return run_multicore(fn, items, config, parallel=parallel)
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore
for data in joblib.Parallel(parallel["num_jobs"])(joblib.delayed(fn)(x) for x in items):
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 804, in __call__
while self.dispatch_one_batch(iterator):
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 662, in dispatch_one_batch
self._dispatch(tasks)
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 570, in _dispatch
job = ImmediateComputeBatch(batch)
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 183, in __init__
self.results = batch()
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 72, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 51, in wrapper
return apply(f, *args, **kwargs)
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 203, in compare_to_rm
return validate.compare_to_rm(*args)
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/variation/validate.py", line 108, in compare_to_rm
eval_files = _run_rtg_eval(vrn_file, rm_file, rm_interval_file, base_dir, toval_data)
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/variation/validate.py", line 166, in _run_rtg_eval
cmd += ["--vcf-score-field='%s'" % (_pick_best_quality_score(vrn_file))]
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/variation/validate.py", line 193, in _pick_best_quality_score
with VariantFile(vrn_file) as val_in:
File "pysam/cbcf.pyx", line 2359, in pysam.cbcf.VariantFile.__init__ (pysam/cbcf.c:36299)
File "pysam/cbcf.pyx", line 2519, in pysam.cbcf.VariantFile.open (pysam/cbcf.c:38620)
File "pysam/cbcf.pyx", line 2590, in pysam.cbcf.VariantFile.open (pysam/cbcf.c:39699)
File "pysam/cbcf.pyx", line 2113, in pysam.cbcf.makeTabixIndex (pysam/cbcf.c:33771)
File "pysam/cbcf.pyx", line 2093, in pysam.cbcf.TabixIndex.__init__ (pysam/cbcf.c:33312)
ValueError: Cannot retrieve reference sequence names
However, under validation, the 18-11/ensemble directory does not exist. Hence, no vcfeval.log to investigate. Do you need data / configuration files to investigate?
Rerunning the script using -n 1
unfortunately fails to explain which file that is causing the problem. The the only output I get is found below.
[2016-01-22T12:53Z] Timing: validation
[2016-01-22T12:53Z] multiprocessing: compare_to_rm
[2016-01-22T12:53Z] Resource requests: ; memory: 1.00; cores: 1
[2016-01-22T12:53Z] Configuring 1 jobs to run, using 1 cores each with 1.00g of memory reserved for each job
Traceback (most recent call last):
File "/home/genetik/bcbio/bin/bcbio_nextgen.py", line 226, in <module>
main(**kwargs)
File "/home/genetik/bcbio/bin/bcbio_nextgen.py", line 43, in main
run_main(**kwargs)
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 39, in run_main
fc_dir, run_info_yaml)
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 82, in _run_toplevel
for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 166, in variant2pipeline
samples = run_parallel("compare_to_rm", samples)
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
return run_multicore(fn, items, config, parallel=parallel)
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore
for data in joblib.Parallel(parallel["num_jobs"])(joblib.delayed(fn)(x) for x in items):
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 804, in __call__
while self.dispatch_one_batch(iterator):
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 662, in dispatch_one_batch
self._dispatch(tasks)
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 570, in _dispatch
job = ImmediateComputeBatch(batch)
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 183, in __init__
self.results = batch()
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 72, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 51, in wrapper
return apply(f, *args, **kwargs)
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 203, in compare_to_rm
return validate.compare_to_rm(*args)
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/variation/validate.py", line 108, in compare_to_rm
eval_files = _run_rtg_eval(vrn_file, rm_file, rm_interval_file, base_dir, toval_data)
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/variation/validate.py", line 166, in _run_rtg_eval
cmd += ["--vcf-score-field='%s'" % (_pick_best_quality_score(vrn_file))]
File "/home/genetik/bcbio/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/variation/validate.py", line 193, in _pick_best_quality_score
with VariantFile(vrn_file) as val_in:
File "pysam/cbcf.pyx", line 2359, in pysam.cbcf.VariantFile.__init__ (pysam/cbcf.c:36299)
File "pysam/cbcf.pyx", line 2519, in pysam.cbcf.VariantFile.open (pysam/cbcf.c:38620)
File "pysam/cbcf.pyx", line 2590, in pysam.cbcf.VariantFile.open (pysam/cbcf.c:39699)
File "pysam/cbcf.pyx", line 2113, in pysam.cbcf.makeTabixIndex (pysam/cbcf.c:33771)
File "pysam/cbcf.pyx", line 2093, in pysam.cbcf.TabixIndex.__init__ (pysam/cbcf.c:33312)
ValueError: Cannot retrieve reference sequence names
Pär;
Thanks for following up. You're right that something is going wrong with parsing the variant file, although it is also tough to tell without the exact file. I pushed a small change to report the file that fails which might help isolate this, if you upgrade to the latest development (bcbio_nextgen.py upgrade -u development
). From your debugging so far it might be the 18-11 ensemble file. Is there anything unusual about that file that might give an idea of how to reproduce? If it's empty or otherwise malformed I can try to replicate and work around the issue. Thanks again for the help debugging.
Thanks. I retried after updating.
The error message this time was Failed to parse input file /home/genetik/revalidation/calling_2016.01.13-15.14.13/work/platypus/batch1-filter-18-11.vcf.gz
.
Except for the header, it turns out that this file is empty. No variants there. I guess that makes sense then that it cannot find any "reference sequence names".
Pär;
Thanks for confirming this. I pushed a fix to the latest development version which skips trying to parse the file if it's empty without variants. Apologies, I hadn't realized pysam would fail in this situation. If you update (bcbio_nextgen.py upgrade -u development
) then it should skip this and continue cleanly. Please let us know if you run into any other problems with this.
Hi again! It seems that the same problem propagates downstream, rtg exits with this message:
subprocess.CalledProcessError: Command 'set -o pipefail; export RTG_JAVA_OPTS='-Xms1g' && export RTG_MEM=5g && rtg vcfeval --threads 6 -b /home/genetik/revalidation/calling_2016.01.13-15.14.13/work/validate/18-11/platypus/18-11-ensemble-single.vt.vcf.gz --bed-regions /home/genetik/revalidation/calling_2016.01.13-15.14.13/work/validate/18-11/platypus/18-11-sort-callable_sample-18-11-platypus-wrm.bed -c /home/genetik/revalidation/calling_2016.01.13-15.14.13/work/platypus/batch1-filter-18-11.vcf.gz -t /home/genetik/bcbio/share/bcbio/genomes/Hsapiens/hg19/rtg/hg19.sdf -o /home/genetik/revalidation/calling_2016.01.13-15.14.13/work/validate/18-11/platypus/rtg --vcf-score-field='DP'
Error: There were no sequence names in common between the supplied baseline and called variant sets. Check they use the same reference and are non-empty.
' returned non-zero exit status 1
Pär; Sorry about the continued problems. I pushed a fix to skip these empty ones entirely during validation, which should hopefully avoid all these problems if you update to development one more time. Hope this works for your analysis.
thanks!
I'm getting an error running a validation against a custom reference call set. Any ideas what might be wrong?
The config file looks like this: