bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
986 stars 354 forks source link

samtools/mpileup error #1199

Closed parlar closed 8 years ago

parlar commented 8 years ago

I believe that I'm getting the same error as https://github.com/chapmanb/bcbio-nextgen/issues/1154 ,see traceback below.

2016-01-22T12:02Z] Varscan
[2016-01-22T12:02Z] [mpileup] 1 samples in 1 input files
[2016-01-22T12:02Z] <mpileup> Set max per-file depth to 8000
[2016-01-22T12:02Z] Uncaught exception occurred
Traceback (most recent call last):
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; /usr/local/bin/samtools mpileup -f /usr/local/share/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa -d 1000 -L 1000 -l /home/lindak/Data/LiquidBiopsies/TruSight15/160115/calling/work/varscan/7/H1975-B-7_0_55089310-raw-regions.bed /home/lindak/Data/LiquidBiopsies/TruSight15/160115/calling/work/bamprep/H1975-B_tumor/7/H1975-B_tumor-sort-7_0_55089310-prep.bam | grep -v -P '\t0\t\t$' | ifne varscan -Xms681m -Xmx1818m -Duser.language=en -Duser.country=US -XX:+UseSerialGC -Djava.io.tmpdir=/home/lindak/Data/LiquidBiopsies/TruSight15/160115/calling/work/tx/tmpDWLWpA mpileup2cns --min-coverage 5 --p-value 0.98   --vcf-sample-list /home/lindak/Data/LiquidBiopsies/TruSight15/160115/calling/work/varscan/7/tx/tmpxlP4EC/H1975-B-7_0_55089310-raw-sample_list.txt --output-vcf --variants | /usr/local/share/bcbio/anaconda/bin/py -x 'bcbio.variation.varscan.fix_varscan_output(x)' | awk -F$'\t' -v OFS='\t' '{if ($0 !~ /^#/) gsub(/[KMRYSWBVHDX]/, "N", $4) } {print}' | awk -F$'\t' -v OFS='\t' '{if ($0 !~ /^#/) gsub(/[KMRYSWBVHDX]/, "N", $5) } {print}' | ifne vcfuniqalleles > /home/lindak/Data/LiquidBiopsies/TruSight15/160115/calling/work/varscan/7/tx/tmpxlP4EC/H1975-B-7_0_55089310-raw.vcf
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
' returned non-zero exit status 1
Traceback (most recent call last):
  File "/usr/local/bin/bcbio_nextgen.py", line 226, in <module>
    main(**kwargs)
  File "/usr/local/bin/bcbio_nextgen.py", line 43, in main
    run_main(**kwargs)
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 39, in run_main
    fc_dir, run_info_yaml)
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 82, in _run_toplevel
    for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 149, in variant2pipeline
    samples = genotype.parallel_variantcall_region(samples, run_parallel)
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/variation/genotype.py", line 169, in parallel_variantcall_region
    "vrn_file", ["region", "sam_ref", "config"]))
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/split.py", line 32, in grouped_parallel_split_combine
    final_output = parallel_fn(parallel_name, split_args)
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
    return run_multicore(fn, items, config, parallel=parallel)
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore
    for data in joblib.Parallel(parallel["num_jobs"])(joblib.delayed(fn)(x) for x in items):
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 804, in __call__
    while self.dispatch_one_batch(iterator):
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 662, in dispatch_one_batch
    self._dispatch(tasks)
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 570, in _dispatch
    job = ImmediateComputeBatch(batch)
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 183, in __init__
    self.results = batch()
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 72, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 51, in wrapper
    return apply(f, *args, **kwargs)
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 151, in variantcall_sample
    return genotype.variantcall_sample(*args)
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/variation/genotype.py", line 254, in variantcall_sample
    call_file = caller_fn(align_bams, items, sam_ref, assoc_files, region, call_file)
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/variation/varscan.py", line 34, in run_varscan
    region, out_file)
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/variation/samtools.py", line 40, in shared_variantcall
    tx_out_file)
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/variation/varscan.py", line 309, in _varscan_work
    [do.file_exists(out_file)])
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
subprocess.CalledProcessError: Command 'set -o pipefail; /usr/local/bin/samtools mpileup -f /usr/local/share/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa -d 1000 -L 1000 -l /home/lindak/Data/LiquidBiopsies/TruSight15/160115/calling/work/varscan/7/H1975-B-7_0_55089310-raw-regions.bed /home/lindak/Data/LiquidBiopsies/TruSight15/160115/calling/work/bamprep/H1975-B_tumor/7/H1975-B_tumor-sort-7_0_55089310-prep.bam | grep -v -P '\t0\t\t$' | ifne varscan -Xms681m -Xmx1818m -Duser.language=en -Duser.country=US -XX:+UseSerialGC -Djava.io.tmpdir=/home/lindak/Data/LiquidBiopsies/TruSight15/160115/calling/work/tx/tmpDWLWpA mpileup2cns --min-coverage 5 --p-value 0.98   --vcf-sample-list /home/lindak/Data/LiquidBiopsies/TruSight15/160115/calling/work/varscan/7/tx/tmpxlP4EC/H1975-B-7_0_55089310-raw-sample_list.txt --output-vcf --variants | /usr/local/share/bcbio/anaconda/bin/py -x 'bcbio.variation.varscan.fix_varscan_output(x)' | awk -F$'\t' -v OFS='\t' '{if ($0 !~ /^#/) gsub(/[KMRYSWBVHDX]/, "N", $4) } {print}' | awk -F$'\t' -v OFS='\t' '{if ($0 !~ /^#/) gsub(/[KMRYSWBVHDX]/, "N", $5) } {print}' | ifne vcfuniqalleles > /home/lindak/Data/LiquidBiopsies/TruSight15/160115/calling/work/varscan/7/tx/tmpxlP4EC/H1975-B-7_0_55089310-raw.vcf
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
' returned non-zero exit status 1

Running

$ /usr/local/bin/samtools mpileup -f /usr/local/share/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa -d 1000 -L 1000 -l /home/lindak/Data/LiquidBiopsies/TruSight15/160115/calling/work/varscan/7/H1975-B-7_0_55089310-raw-regions.bed /home/lindak/Data/LiquidBiopsies/TruSight15/160115/calling/work/bamprep/H1975-B_tumor/7/H1975-B_tumor-sort-7_0_55089310-prep.bam | grep -v -P '\t0\t\t$' | ifne varscan -Xms681m -Xmx1818m -Duser.language=en -Duser.country=US -XX:+UseSerialGC -Djava.io.tmpdir=/home/lindak/Data/LiquidBiopsies/TruSight15/160115/calling/work/tx/tmpDWLWpA mpileup2cns --min-coverage 5 --p-value 0.98   --vcf-sample-list /home/lindak/Data/LiquidBiopsies/TruSight15/160115/calling/work/varscan/7/tx/tmpxlP4EC/H1975-B-7_0_55089310-raw-sample_list.txt --output-vcf --variants | /usr/local/share/bcbio/anaconda/bin/py -x 'bcbio.variation.varscan.fix_varscan_output(x)' | awk -F$'\t' -v OFS='\t' '{if ($0 !~ /^#/) gsub(/[KMRYSWBVHDX]/, "N", $4) } {print}' | awk -F$'\t' -v OFS='\t' '{if ($0 !~ /^#/) gsub(/[KMRYSWBVHDX]/, "N", $5) } {print}' | ifne vcfuniqalleles > /home/lindak/Data/LiquidBiopsies/TruSight15/160115/calling/work/varscan/7/tx/tmpxlP4EC/H1975-B-7_0_55089310-raw.vcf

gives this output:

-bash: /home/lindak/Data/LiquidBiopsies/TruSight15/160115/calling/work/varscan/7/tx/tmpxlP4EC/H1975-B-7_0_55089310-raw.vcf: No such file or directory
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000

I notice that the temporary directory .../tmpxlP4EC/.., where the vcf output should go, does not exist, i.e. /home/lindak/Data/LiquidBiopsies/TruSight15/160115/calling/work/varscan/7/tx/ is empty. Could this be the source of the problem?

chapmanb commented 8 years ago

Pär; Sorry about the problems with VarScan calling. It looks like some process in the pipe fails but it's tough to tell what is wrong as there is not a useful error message (the mpileup message is standard and tell you it's downsampling at 8000 reads).

Re-running the command manually is the best way to do, although as you noticed you have to replace the temporary directories since those are created/removed by bcbio as part of processing. It's a bit tricky with VarScan since bcbio also creates a sample list in the temporary directory: You should be able to do:

/usr/local/bin/samtools mpileup -f /usr/local/share/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa -d 1000 -L 1000 -l /home/lindak/Data/LiquidBiopsies/TruSight15/160115/calling/work/varscan/7/H1975-B-7_0_55089310-raw-regions.bed /home/lindak/Data/LiquidBiopsies/TruSight15/160115/calling/work/bamprep/H1975-B_tumor/7/H1975-B_tumor-sort-7_0_55089310-prep.bam | grep -v -P '\t0\t\t$' | ifne varscan -Xms681m -Xmx1818m -Duser.language=en -Duser.country=US -XX:+UseSerialGC  mpileup2cns --min-coverage 5 --p-value 0.98  --output-vcf --variants | /usr/local/share/bcbio/anaconda/bin/py -x 'bcbio.variation.varscan.fix_varscan_output(x)' | awk -F$'\t' -v OFS='\t' '{if ($0 !~ /^#/) gsub(/[KMRYSWBVHDX]/, "N", $4) } {print}' | awk -F$'\t' -v OFS='\t' '{if ($0 !~ /^#/) gsub(/[KMRYSWBVHDX]/, "N", $5) } {print}' | ifne vcfuniqalleles > H1975-B-7_0_55089310-raw.vcf

Depending on if you can get any useful errors from that, you might have to break the line down to see what is happening. Sorry to not have a better way to debug -- happy to try and offer suggestions if you can identify any other error messages that give us a better idea of what is going wrong. Hope this helps.

parlar commented 8 years ago

the mpileup output generated by samtools is empty.

The regions file (H1975-B-7_0_55089310-raw-regions.bed )contains only one row:

7 55089120 55089310 .

But, if I run samtools mpileup on the bam file (H1975-B_tumor/7/H1975-B_tumor-sort-7_0_55089310-prep.bam), I get the output below. The regions don't overlap, so it makes sense I guess. But looking at the bam file using samtools view shows that there are indeed reads there. I guess that they are removed by some default filtering criterion (qual,min cov ... ?).

I suppose that the empty mpileup output must be to blame here? Any way to allow for empty mpileup outputs without resulting in the error?

$ samtools mpileup H1975-B_tumor/7/H1975-B_tumor-sort-7_0_55089310-prep.bam

7   47671587    C   1   ^]. A
7   47671588    T   1   .   B
7   47671589    G   1   .   A
7   47671590    A   1   .   B
7   47671591    G   1   .+1T    B
7   47671592    C   1   .   F
7   47671593    A   1   .   F
7   47671594    G   1   .   F
7   47671595    G   1   .   F
7   47671596    C   1   .   F
7   47671597    C   1   .   B
7   47671598    C   1   .   G
7   47671599    T   1   .   G
7   47671600    T   1   .   G
7   47671601    A   1   .   G
7   47671602    A   1   .   G
7   47671603    C   1   .   G
7   47671604    T   1   .   G
7   47671605    A   1   .   G
7   47671606    G   1   .   G
7   47671607    C   1   .   G
7   47671608    T   1   .   H
7   47671609    C   1   .   H
7   47671610    G   1   .   G
7   47671611    C   1   .   G
7   47671612    C   1   .   G
7   47671613    T   1   .   G
7   47671614    T   1   .   G
7   47671615    C   1   .   H
7   47671616    C   1   .   H
7   47671617    T   1   .   H
7   47671618    G   1   .   G
7   47671619    C   1   .   H
7   47671620    A   1   .   H
7   47671621    G   1   .   H
7   47671622    G   1   .   H
7   47671623    C   1   .   G
7   47671624    A   1   .   H
7   47671625    T   1   .   H
7   47671626    C   1   .   H
7   47671627    T   1   .   G
7   47671628    T   1   .   H
7   47671629    T   1   .   H
7   47671630    T   1   .   G
7   47671631    G   1   .   H
7   47671632    C   1   .   H
7   47671633    A   1   .   H
7   47671634    C   1   .   H
7   47671635    A   1   .   H
7   47671636    C   1   .   H
7   47671637    C   1   .   G
7   47671638    T   1   .   H
7   47671639    G   1   .   G
7   47671640    C   1   .   H
7   47671641    T   1   .   H
7   47671642    G   1   .   H
7   47671643    T   1   .   H
7   47671644    G   1   .   H
7   47671645    T   1   .   H
7   47671646    G   1   .   H
7   47671647    C   1   .   H
7   47671648    C   1   .   H
7   47671649    G   1   .   G
7   47671650    G   1   .   G
7   47671651    G   1   .   G
7   47671652    T   1   .   1
7   47671653    G   1   .   E
7   47671654    C   1   .   ?
7   47671655    T   1   .   G
7   47671656    G   1   .   G
7   47671657    G   1   .   H
7   47671658    T   1   .   H
7   47671659    T   1   .   H
7   47671660    G   1   .   G
7   47671661    A   1   .   H
7   47671662    A   1   .   E
7   47671663    G   1   .   G
7   47671664    A   1   .   H
7   47671665    T   1   .   H
7   47671666    G   1   .   H
7   47671667    C   1   .   H
7   47671668    A   1   .   H
7   47671669    A   1   .   H
7   47671670    C   1   .   H
7   47671671    C   1   .   H
7   47671672    T   1   .   H
7   47671673    C   1   .   H
7   47671674    T   1   .   H
7   47671675    G   1   .   H
7   47671676    A   1   .   H
7   47671677    C   1   .   H
7   47671678    C   1   .   H
7   47671679    T   1   .   H
7   47671680    T   1   .   H
7   47671681    T   1   .   H
7   47671682    A   1   .   H
7   47671683    G   1   .   H
7   47671684    G   1   .   H
7   47671685    G   1   .   H
7   47671686    A   1   .   H
7   47671687    T   1   .   H
7   47671688    T   1   .   H
7   47671689    G   1   .   H
7   47671690    C   1   .   G
7   47671691    A   1   .   H
7   47671692    A   1   .   H
7   47671693    G   1   .   H
7   47671694    C   1   .   G
7   47671695    T   1   .   H
7   47671696    T   1   .   H
7   47671697    G   1   .   H
7   47671698    A   1   .   H
7   47671699    G   1   .   H
7   47671700    C   1   .   H
7   47671701    T   1   .   H
7   47671702    C   1   .   H
7   47671703    G   1   .   G
7   47671704    G   1   .   G
7   47671705    T   1   .   G
7   47671706    G   1   .   G
7   47671707    G   1   .   G
7   47671708    G   1   .   G
7   47671709    G   1   .   G
7   47671710    A   1   .   G
7   47671711    C   1   .   G
7   47671712    A   1   .   G
7   47671713    C   1   .   H
7   47671714    A   1   .   H
7   47671715    G   1   .   H
7   47671716    A   1   .   H
7   47671717    T   1   .   H
7   47671718    G   1   .   H
7   47671719    A   1   .   H
7   47671720    T   1   .   H
7   47671721    A   1   .   H
7   47671722    G   1   .   e
7   47671723    A   1   .   o
7   47671724    T   1   .   m
7   47671725    A   1   .   n
7   47671726    A   1   .   n
7   47671727    A   1   .   i
7   47671728    T   1   .   Z
7   47671729    A   1   .   n
7   47671730    A   1   .   i
7   47671731    C   1   .   k
7   47671732    A   1   .   n
7   47671733    C   1   .   o
7   47671734    A   1   .   o
7   47671735    G   1   .   n
7   47671736    T   1   .$  X
7   47671737    G   1   ,   H
7   47671738    T   1   ,   H
7   47671739    G   1   ,   H
7   47671740    A   1   ,   H
7   47671741    T   1   ,   G
7   47671742    A   1   ,   H
7   47671743    A   1   ,   B
7   47671744    C   1   ,   H
7   47671745    C   1   ,   H
7   47671746    A   1   ,   H
7   47671747    T   1   ,   F
7   47671748    C   1   ,   H
7   47671749    T   1   ,   H
7   47671750    G   1   ,   H
7   47671751    G   1   ,   G
7   47671752    T   1   ,   H
7   47671753    C   1   ,   H
7   47671754    A   1   ,   H
7   47671755    G   1   ,   H
7   47671756    G   1   ,   H
7   47671757    G   1   ,   H
7   47671758    A   1   ,   H
7   47671759    T   1   ,   H
7   47671760    G   1   ,   H
7   47671761    G   1   ,   G
7   47671762    A   1   ,   G
7   47671763    A   1   ,   F
7   47671764    C   1   ,   H
7   47671765    C   1   ,   H
7   47671766    T   1   ,   H
7   47671767    G   1   ,   G
7   47671768    A   1   ,   H
7   47671769    T   1   ,   H
7   47671770    G   1   ,   H
7   47671771    A   1   ,   H
7   47671772    G   1   ,   H
7   47671773    G   1   ,   H
7   47671774    A   1   ,   H
7   47671775    A   1   ,   H
7   47671776    G   1   ,   H
7   47671777    A   1   ,   H
7   47671778    A   1   ,   G
7   47671779    G   1   ,   H
7   47671780    T   1   ,   H
7   47671781    G   1   ,   G
7   47671782    A   1   ,   F
7   47671783    C   1   ,   4
7   47671784    T   1   ,   H
7   47671785    G   1   ,   H
7   47671786    G   1   ,   F
7   47671787    A   1   ,   H
7   47671788    A   1   ,   H
7   47671789    T   1   ,   G
7   47671790    C   1   ,   G
7   47671791    C   1   ,   G
7   47671792    T   1   ,   H
7   47671793    C   1   ,   F
7   47671794    C   1   ,   H
7   47671795    T   1   ,   F
7   47671796    G   1   ,   E
7   47671797    G   1   ,   E
7   47671798    G   1   ,   G
7   47671799    C   1   ,   H
7   47671800    T   1   ,   H
7   47671801    G   1   ,   H
7   47671802    G   1   ,   H
7   47671803    A   1   ,   H
7   47671804    A   1   ,   H
7   47671805    G   1   ,   G
7   47671806    G   1   ,   G
7   47671807    T   1   ,   G
7   47671808    G   1   ,   G
7   47671809    G   1   ,   G
7   47671810    G   1   ,   G
7   47671811    G   1   ,   H
7   47671812    G   1   ,   H
7   47671813    A   1   ,   H
7   47671814    G   1   ,   F
7   47671815    C   1   ,   H
7   47671816    T   1   ,   G
7   47671817    T   1   ,   G
7   47671818    T   1   ,   H
7   47671819    T   1   ,   H
7   47671820    A   1   ,   H
7   47671821    T   1   ,   H
7   47671822    G   1   ,   H
7   47671823    T   1   ,   H
7   47671824    C   1   ,   H
7   47671825    T   1   ,   H
7   47671826    G   1   ,   H
7   47671827    C   1   ,   H
7   47671828    A   1   ,   G
7   47671829    G   1   ,   H
7   47671830    T   1   ,   H
7   47671831    C   1   ,   H
7   47671832    T   1   ,   H
7   47671833    C   1   ,   B
7   47671834    T   1   ,   H
7   47671835    T   1   ,   G
7   47671836    T   1   ,   H
7   47671837    G   1   ,   H
7   47671838    G   1   ,   F
7   47671839    G   1   ,   H
7   47671840    T   1   ,   G
7   47671841    A   1   ,   G
7   47671842    G   1   ,   G
7   47671843    G   1   ,   H
7   47671844    G   1   ,   H
7   47671845    A   1   ,   H
7   47671846    A   1   ,   F
7   47671847    G   1   ,   G
7   47671848    A   1   ,   H
7   47671849    A   1   ,   F
7   47671850    A   1   ,   H
7   47671851    A   1   ,   G
7   47671852    T   1   ,   G
7   47671853    G   1   ,   G
7   47671854    A   1   ,   G
7   47671855    A   1   ,   G
7   47671856    G   1   ,   C
7   47671857    C   1   ,   E
7   47671858    A   1   ,   G
7   47671859    G   1   ,   C
7   47671860    G   1   ,   C
7   47671861    G   1   ,   B
7   47671862    G   1   ,   F
7   47671863    A   1   ,   F
7   47671864    G   1   ,   4
7   47671865    G   1   ,   @
7   47671866    C   1   ,   B
7   47671867    C   1   ,   A
7   47671868    T   0       
7   47671869    T   0   
chapmanb commented 8 years ago

Pär; Thanks for the debugging information. This helped a lot and I think I identified the underlying issue and pushed a fix to the latest development. This should now correctly pass through empty mpileups without failing. Please let me know if you run into any other issues at all.

parlar commented 8 years ago

great, will try asap

hyong2000 commented 8 years ago

Hi Chapmanb,

I updated bcbio pipeline version from 0.9.6 to 0.9.7 to fix this problem, but I guess I still have the same problem as followings.

[2016-05-11T13:23Z] Timing: structural variation initial [2016-05-11T13:23Z] Timing: hla typing [2016-05-11T13:23Z] run local -- checkpoint passed: full [2016-05-11T13:23Z] Timing: alignment post-processing [2016-05-11T13:23Z] multiprocessing: piped_bamprep [2016-05-11T13:23Z] Timing: variant calling [2016-05-11T13:23Z] multiprocessing: variantcall_sample [2016-05-11T13:42Z] Uncaught exception occurred Traceback (most recent call last): File "/project/ibilab/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run _do_run(cmd, checks, log_stdout) File "/project/ibilab/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run raise subprocess.CalledProcessError(exitcode, error_msg) CalledProcessError: Command 'set -o pipefail; /project/ibilab/tools/bcbio/anaconda/bin/samtools mpileup -f /project/ibilab/tools/bcbio/genomes/Hsapiens/hg19/seq/hg19.fa -d 1000 -L 1000 -l /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result2/work/varscan/chr14/3918_S8_L001-chr14_36199824_52492964-regions.bed /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result2/work/bamprep/3918_S8_L001/chr14/3918_S8_L001-sort-chr14_36199824_52492964-prep.bam | ifne grep -v -P '\t0\t\t$' | ifne varscan -Xms681m -Xmx1818m -Duser.language=en -Duser.country=US -XX:+UseSerialGC -Djava.io.tmpdir=/project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result2/work/tx/tmphX3aev mpileup2cns --min-coverage 5 --p-value 0.98 --vcf-sample-list /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result2/work/varscan/chr14/tx/tmpeC4VYD/3918_S8_L001-chr14_36199824_52492964-sample_list.txt --output-vcf --variants | /project/ibilab/tools/bcbio/anaconda/bin/py -x 'bcbio.variation.varscan.fix_varscan_output(x)' | awk -F$'\t' -v OFS='\t' '{if ($0 !~ /^#/) gsub(/[KMRYSWBVHDX]/, "N", $4) } {print}' | awk -F$'\t' -v OFS='\t' '{if ($0 !~ /^#/) gsub(/[KMRYSWBVHDX]/, "N", $5) } {print}' | ifne vcfuniqalleles > /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result2/work/varscan/chr14/tx/tmpeC4VYD/3918_S8_L001-chr14_36199824_52492964.vcf [mpileup] 1 samples in 1 input files

Set max per-file depth to 8000 ' returned non-zero exit status 1 When I run “/project/ibilab/tools/bcbio/anaconda/bin/samtools mpileup -f /project/ibilab/tools/bcbio/genomes/Hsapiens/hg19/seq/hg19.fa -d 1000 -L 1000 -l /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result2/work/varscan/chr14/3918_S8_L001-chr14_36199824_52492964-regions.bed /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result2/work/bamprep/3918_S8_L001/chr14/3918_S8_L001-sort-chr14_36199824_52492964-prep.bam” I got followings. [mpileup] 1 samples in 1 input files Set max per-file depth to 8000 chr14 36199825 g 0 chr14 36199826 a 0 chr14 36199827 g 0 chr14 36199828 t 0 chr14 36199829 t 0 chr14 36199830 c 0 With 'ifne grep -v -P '\t0\t\t$', I have empty string. So, it seems that there is still “empty string” problem? Please let me have your thoughts. Thanks!
chapmanb commented 8 years ago

Sorry about the issue. I'm trying to replicate this by feeding an empty file into the start of the chained post-processing but get an empty file and zero error code so am not sure what part is causing an issue on your run. Would you be able to step through the rest of the command:

| ifne grep -v -P '\t0\t\t$' | ifne varscan -Xms681m -Xmx1818m -Duser.language=en -Duser.country=US -XX:+UseSerialGC -Djava.io.tmpdir=/project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result2/work/tx/tmphX3aev mpileup2cns --min-coverage 5 --p-value 0.98 --vcf-sample-list /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result2/work/varscan/chr14/tx/tmpeC4VYD/3918_S8_L001-chr14_36199824_52492964-sample_list.txt --output-vcf --variants | /project/ibilab/tools/bcbio/anaconda/bin/py -x 'bcbio.variation.varscan.fix_varscan_output(x)' | awk -F$'\t' -v OFS='\t' '{if ($0 !~ /^#/) gsub(/[KMRYSWBVHDX]/, "N", $4) } {print}' | awk -F$'\t' -v OFS='\t' '{if ($0 !~ /^#/) gsub(/[KMRYSWBVHDX]/, "N", $5) } {print}' | ifne vcfuniqalleles

and add each separate command to the end of the run one at a time, and then echo $? afterwards to see which command returns a non-zero exit code. If you can identify the underlying cause we can adjust to fix. It's possible the awk commands do something different on your machine with empty input as those are not protected with ifne.

Thanks much for the help debugging.

hyong2000 commented 8 years ago

Yes. Interestingly, I did not see any error message and also found an empty result file (work/varscan/chr14/3918_S8_L001-chr14_36199824_52492964.vcf) when I run the following script which I copied from the error log (except changing "temp" dir for the result).

/project/ibilab/tools/bcbio/anaconda/bin/samtools mpileup -f /project/ibilab/tools/bcbio/genomes/Hsapiens/hg19/seq/hg19.fa -d 1000 -L 1000 -l /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result2/work/varscan/chr14/3918_S8_L001-chr14_36199824_52492964-regions.bed /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result2/work/bamprep/3918_S8_L001/chr14/3918_S8_L001-sort-chr14_36199824_52492964-prep.bam | ifne grep -v -P '\t0\t\t$' | ifne varscan -Xms681m -Xmx1818m -Duser.language=en -Duser.country=US -XX:+UseSerialGC -Djava.io.tmpdir=/project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result2/work/tx/tmphX3aev mpileup2cns --min-coverage 5 --p-value 0.98 --vcf-sample-list /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result2/work/varscan/chr14/tx/tmpeC4VYD/3918_S8_L001-chr14_36199824_52492964-sample_list.txt --output-vcf --variants | /project/ibilab/tools/bcbio/anaconda/bin/py -x 'bcbio.variation.varscan.fix_varscan_output(x)' | awk -F$'\t' -v OFS='\t' '{if ($0 !~ /^#/) gsub(/[KMRYSWBVHDX]/, "N", $4) } {print}' | awk -F$'\t' -v OFS='\t' '{if ($0 !~ /^#/) gsub(/[KMRYSWBVHDX]/, "N", $5) } {print}' | ifne vcfuniqalleles > /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result2/work/varscan/chr14/3918_S8_L001-chr14_36199824_52492964.vcf

But, it kept throwing error when I run it as a whole pipeline as followings. (I ran it serveral times.)

bcbio_nextgen.py /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result2/config/ensemble_trim.yaml -t local -n 1

I wonder whether differences on i) running on a whole pipeline and ii) running as a chained single script make any difference on the sensitivity of error. It seems like this error only occurs when it run as a whole pipeline, but it seems to work well running as a chained single script in the same cluster. Thank you!

chapmanb commented 8 years ago

Thanks much for helping debug this. My guess is that there won't be an error message but one of the subcommands returns a non-zero exit code and a (correct) empty file, so only running and looking at the output won't help identify it.

If you look at the error code (with echo $?) after running the command externally, is it returning 0 or 1? My guess is 1 based on the bcbio error message, and to debug where this happens you can keep removing commands from the end of the piped process until it returns zero, then we'll identify where the problem occurs and can fix.

Sorry for the debugging process and hopefully this helps identify it.

hyong2000 commented 8 years ago

Got it! I ran the command by adding each separate command to the end of the run one at a time, and did echo $?. I finally found the place that make an error as followings.

$ /project/ibilab/tools/bcbio/anaconda/bin/samtools mpileup -f /project/ibilab/tools/bcbio/genomes/Hsapiens/hg19/seq/hg19.fa -d 1000 -L 1000 -l /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result2/work/varscan/chr14/3918_S8_L001-chr14_36199824_52492964-regions.bed /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result2/work/bamprep/3918_S8_L001/chr14/3918_S8_L001-sort-chr14_36199824_52492964-prep.bam | ifne grep -v -P '\t0\t\t$' [mpileup] 1 samples in 1 input files

Set max per-file depth to 8000 $ echo $? 1 It seems like 'ifne grep -v -P '\t0\t\t$'' returns non-zero exit code. Then, what do I have to do to fix this? Thank you again!!
chapmanb commented 8 years ago

Thanks for debugging and isolating the issue. I'm confused as to what is going on since the ifne is supposed to protect against this -- it should skip the grep step which raises the non-zero output. What type of system are you running this on? I can try to replicate locally.

To help with debugging, does this give the expected return codes:

touch empty.txt
cat empty.txt | grep -v -P '\t0\t\t$'
echo $? # should be 1
cat empty.txt | ifne grep -v -P '\t0\t\t$'
echo $? # should be 0

Sorry about all the back and forth and not having a good handle about what is going on. Hopefully we can isolate and fix soon to get your analysis running.

hyong2000 commented 8 years ago

Yes. I tried the commands and I have 1 and 0 for the first and the second command respectively, which is expected.

I am using a node from a linux cluster for this analysis and here is the system information.

$ cat /proc/version Linux version 2.6.32-504.el6.x86_64 (mockbuild@x86-023.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-11) (GCC) ) #1 SMP Tue Sep 16 01:56:35 EDT 2014

Please let me know if you need any other information. Thank you!

chapmanb commented 8 years ago

Thanks for following up on this. Now I'm really confused as am not sure why the initial input doesn't get caught by the ifne check. Could you try doing:

/project/ibilab/tools/bcbio/anaconda/bin/samtools mpileup -f /project/ibilab/tools/bcbio/genomes/Hsapiens/hg19/seq/hg19.fa -d 1000 -L 1000 -l /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result2/work/varscan/chr14/3918_S8_L001-chr14_36199824_52492964-regions.bed /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result2/work/bamprep/3918_S8_L001/chr14/3918_S8_L001-sort-chr14_36199824_52492964-prep.bam > problem-pileup.txt
cat problem_pileup.txt | ifne grep -v -P '\t0\t\t$'
echo $? # should be 0

If that doesn't give you that, would you be able to upload the problem_pileup.txt file as a gist (https://gist.github.com/) and I can try to replicate from there. Thanks again.

hyong2000 commented 8 years ago

I got exit code 1 for the following scripts. Please find the file 'problem-pileup.txt' at https://gist.github.com/hyong2000/07613b3155e75dddc0305c0d5d16f142 . If you need any other information, please let me know. Thank you!

/project/ibilab/tools/bcbio/anaconda/bin/samtools mpileup -f /project/ibilab/tools/bcbio/genomes/Hsapiens/hg19/seq/hg19.fa -d 1000 -L 1000 -l /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result2/work/varscan/chr14/3918_S8_L001-chr14_36199824_52492964-regions.bed /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result2/work/bamprep/3918_S8_L001/chr14/3918_S8_L001-sort-chr14_36199824_52492964-prep.bam > problem-pileup.txt cat problem-pileup.txt | ifne grep -v -P '\t0\t\t$' echo $?

hyong2000 commented 8 years ago

I learned that grep returns the exit code ‘1’ if no matches were found. Is there a way to avoid exit code '1' for no matches?

chapmanb commented 8 years ago

Thanks so much for the debugging details. You're exactly right, the issue was due to removing all reads from this input (which is all empty bases). I pushed a fix to avoid the problem, so if you update to the latest development:

bcbio_nextgen.py upgrade -u development

it should hopefully work correctly for you now. Thanks so much for the patience debugging this and hope it gets your analysis finished.

hyong2000 commented 8 years ago

Great! Thank you! I will let you know whether it works correctly.

hyong2000 commented 8 years ago

I worked correctly for this issue, but the new development version (0.9.8a0) fired other error as followings.

[2016-05-16T17:41Z] Timing: hla typing [2016-05-16T17:41Z] Timing: alignment post-processing [2016-05-16T17:41Z] multiprocessing: piped_bamprep [2016-05-16T23:11Z] Timing: variant calling [2016-05-16T23:11Z] multiprocessing: variantcall_sample [2016-05-17T16:43Z] multiprocessing: concat_variant_files [2016-05-17T16:43Z] Uncaught exception occurred Traceback (most recent call last): File "/project/ibilab/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run _do_run(cmd, checks, log_stdout) File "/project/ibilab/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run raise subprocess.CalledProcessError(exitcode, error_msg) CalledProcessError: Command 'set -o pipefail; /project/ibilab/tools/bcbio/anaconda/bin/bcftools concat --allow-overlaps -O z --file-list /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result4/work/mutect/3500_S6_L001-files.list -o /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result4/work/mutect/tx/tmpsKfHk6/3500_S6_L001.vcf.gz Different number of samples in /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result4/work/mutect/chr1/3500_S6_L001-chr1_0_16044553.vcf.gz. Perhaps "bcftools merge" is what you are looking for? ' returned non-zero exit status 255

Is there a way that I can just fix this part based on the stable version of bcbio (0.9.7)?

chapmanb commented 8 years ago

Sorry about the problem, it appears you might have an older version of bcftools that doesn't support the concat operation. What does:

 /project/ibilab/tools/bcbio/anaconda/bin/bcftools --version

report? It should be version 1.3. If you're bcftools is out of date you can update with:

bcbio_nextgen.py upgrade --tools

and hopefully this will get everything working for you.

hyong2000 commented 8 years ago

I have bcftools 1.3. Please see below.

$ /project/ibilab/tools/bcbio/anaconda/bin/bcftools --version bcftools 1.3 Using htslib 1.3 Copyright (C) 2015 Genome Research Ltd. License Expat: The MIT/Expat license This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

chapmanb commented 8 years ago

Taehyong; Apologies for sending you in the wrong direction. After reading the error again, it looks like something is strange about your MuTect output. It's trying to concatenate together the parallelized variant calls over individual regions but finding that the VCF has different samples in some of the regions. I'm not sure how this would happen, but you could start debugging by looking at the samples trying to be concatenated:

zgrep ^#CHROM `cat /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result4/work/mutect/3500_S6_L001-files.list`

Those should all be identical with the same samples. If that doesn't provide any clues for your process you could post the result of that command and I can try to provide more help. Thanks for the work debugging this.

hyong2000 commented 8 years ago

Thanks for helping!

First thing I want to mention is that this worked before with the previous stable version (0.9.7), but it didn't work after updating bcbio to 0.9.8a0. So something must be happened between these versions.

Anyway, here is the result for the script that you gave me. https://gist.github.com/hyong2000/39e57a84c9fcbd4b4a7ec2054922b805. This generated all of list of the intermediate results for mutect variant calling.

zgrep ^#CHROM cat /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result4/work/mutect/3500_S6_L001-files.list

Now I looked at files of the region (chr1) that error occur and I could not find any problem. (all sample files in the generated list were found in the folder.) Here is the list of file in the folder, /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result4/work/mutect/chr1.

$ ls 3500*vcf.gz 3500_S6_L001-chr1_0_16044553-mutect-orig.vcf.gz 3500_S6_L001-chr1_0_16044553-mutect.vcf.gz 3500_S6_L001-chr1_0_16044553-somaticIndels-gatkann.vcf.gz 3500_S6_L001-chr1_0_16044553-somaticIndels.vcf.gz 3500_S6_L001-chr1_0_16044553.vcf.gz 3500_S6_L001-chr1_109217682_143283354-mutect-orig.vcf.gz 3500_S6_L001-chr1_109217682_143283354-mutect.vcf.gz 3500_S6_L001-chr1_109217682_143283354-somaticIndels-gatkann.vcf.gz 3500_S6_L001-chr1_109217682_143283354-somaticIndels.vcf.gz 3500_S6_L001-chr1_109217682_143283354.vcf.gz 3500_S6_L001-chr1_143767309_159883975-mutect-orig.vcf.gz 3500_S6_L001-chr1_143767309_159883975-mutect.vcf.gz 3500_S6_L001-chr1_143767309_159883975-somaticIndels-gatkann.vcf.gz 3500_S6_L001-chr1_143767309_159883975-somaticIndels.vcf.gz 3500_S6_L001-chr1_143767309_159883975.vcf.gz 3500_S6_L001-chr1_159912052_175645097-mutect-orig.vcf.gz 3500_S6_L001-chr1_159912052_175645097-mutect.vcf.gz 3500_S6_L001-chr1_159912052_175645097-somaticIndels-gatkann.vcf.gz 3500_S6_L001-chr1_159912052_175645097-somaticIndels.vcf.gz 3500_S6_L001-chr1_159912052_175645097.vcf.gz 3500_S6_L001-chr1_16044895_32081736-mutect-orig.vcf.gz 3500_S6_L001-chr1_16044895_32081736-mutect.vcf.gz 3500_S6_L001-chr1_16044895_32081736-somaticIndels-gatkann.vcf.gz 3500_S6_L001-chr1_16044895_32081736-somaticIndels.vcf.gz 3500_S6_L001-chr1_16044895_32081736.vcf.gz 3500_S6_L001-chr1_175734859_192880803-mutect-orig.vcf.gz 3500_S6_L001-chr1_175734859_192880803-mutect.vcf.gz 3500_S6_L001-chr1_175734859_192880803-somaticIndels-gatkann.vcf.gz 3500_S6_L001-chr1_175734859_192880803-somaticIndels.vcf.gz 3500_S6_L001-chr1_175734859_192880803.vcf.gz 3500_S6_L001-chr1_199878716_219943090-mutect-orig.vcf.gz 3500_S6_L001-chr1_199878716_219943090-mutect.vcf.gz 3500_S6_L001-chr1_199878716_219943090-somaticIndels-gatkann.vcf.gz 3500_S6_L001-chr1_199878716_219943090-somaticIndels.vcf.gz 3500_S6_L001-chr1_199878716_219943090.vcf.gz 3500_S6_L001-chr1_220500052_236240522-mutect-orig.vcf.gz 3500_S6_L001-chr1_220500052_236240522-mutect.vcf.gz 3500_S6_L001-chr1_220500052_236240522-somaticIndels-gatkann.vcf.gz 3500_S6_L001-chr1_220500052_236240522-somaticIndels.vcf.gz 3500_S6_L001-chr1_220500052_236240522.vcf.gz 3500_S6_L001-chr1_236245066_249250621-mutect-orig.vcf.gz 3500_S6_L001-chr1_236245066_249250621-mutect.vcf.gz 3500_S6_L001-chr1_236245066_249250621-somaticIndels-gatkann.vcf.gz 3500_S6_L001-chr1_236245066_249250621-somaticIndels.vcf.gz 3500_S6_L001-chr1_236245066_249250621.vcf.gz 3500_S6_L001-chr1_32082041_48267374-mutect-orig.vcf.gz 3500_S6_L001-chr1_32082041_48267374-mutect.vcf.gz 3500_S6_L001-chr1_32082041_48267374-somaticIndels-gatkann.vcf.gz 3500_S6_L001-chr1_32082041_48267374-somaticIndels.vcf.gz 3500_S6_L001-chr1_32082041_48267374.vcf.gz 3500_S6_L001-chr1_48691284_65312469-mutect-orig.vcf.gz 3500_S6_L001-chr1_48691284_65312469-mutect.vcf.gz 3500_S6_L001-chr1_48691284_65312469-somaticIndels-gatkann.vcf.gz 3500_S6_L001-chr1_48691284_65312469-somaticIndels.vcf.gz 3500_S6_L001-chr1_48691284_65312469.vcf.gz 3500_S6_L001-chr1_65323283_81025012-mutect-orig.vcf.gz 3500_S6_L001-chr1_65323283_81025012-mutect.vcf.gz 3500_S6_L001-chr1_65323283_81025012-somaticIndels-gatkann.vcf.gz 3500_S6_L001-chr1_65323283_81025012-somaticIndels.vcf.gz 3500_S6_L001-chr1_65323283_81025012.vcf.gz 3500_S6_L001-chr1_82644031_101324027-mutect-orig.vcf.gz 3500_S6_L001-chr1_82644031_101324027-mutect.vcf.gz 3500_S6_L001-chr1_82644031_101324027-somaticIndels-gatkann.vcf.gz 3500_S6_L001-chr1_82644031_101324027-somaticIndels.vcf.gz 3500_S6_L001-chr1_82644031_101324027.vcf.gz

If you need any other information, please let me know. Thanks!

chapmanb commented 8 years ago

Taehyong; Thanks much for the details. It appears as if VCFs where we skipped processing or have no variants get a single sample in the VCF file (3500_S6_L001) while those that were variant called have two sample names (3500_S6_L001 sample).

Seeing your output files it looks like you have somatic indel calling going on -- which caller are you using for this? I'm not sure where the generic sample name is coming from -- is it present in the calls in the somaticIndels.vcf.gz file? I'm guessing this might be an issue with calling using indelcaller with a tumor-only input so might need more details to help isolate the problem.

hyong2000 commented 8 years ago

Hi Chapman,

For mutect, indel calling was done by scalpel since mutect does not have indel variant calling function. Please see my configuration yaml file for this. (But, as I said, it worked well with the previous version (0.9.7) and this configuration.)

Let me know if you need other information. Thanks!

chapmanb commented 8 years ago

Thanks for the details. It looks like scalpel is now using sample for the default name in the FORMAT line when running with a single sample (non-paired). I pushed a fix to correctly translate this to the correct sample name after analysis. If you update to the latest development, you'll need to remove the mutect directory and re-run that analysis and it should give you correct VCFs now. Sorry about the problem and hope this gets your analysis finished.

hyong2000 commented 8 years ago

Got it! I will update to the latest development and try it again. Thank you!

hyong2000 commented 8 years ago

Hi Chapman, Everything worked correctly now. Thank you so much for your help!!!

hyong2000 commented 8 years ago

Hello! It looks like that I have a similar issue with different samples (reads) now. Please see my debug-log below. I am using bcbio version 0.9.8a0 (bcbio_nextgen.py -v).

########################################################################## [2016-07-07T13:54Z] ..##### ERROR ------------------------------------------------------------------------------------------ [2016-07-07T13:54Z] ##### ERROR stack trace [2016-07-07T13:54Z] java.lang.IllegalStateException: Key ADP found in VariantContext field INFO at chr1:3300162 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default. [2016-07-07T13:54Z] at htsjdk.variant.vcf.VCFEncoder.fieldIsMissingFromHeaderError(VCFEncoder.java:177) [2016-07-07T13:54Z] at htsjdk.variant.vcf.VCFEncoder.encode(VCFEncoder.java:115) [2016-07-07T13:54Z] at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:222) [2016-07-07T13:54Z] at org.broadinstitute.gatk.tools.CatVariants.execute(CatVariants.java:296) [2016-07-07T13:54Z] at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) [2016-07-07T13:54Z] at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) [2016-07-07T13:54Z] at org.broadinstitute.gatk.tools.CatVariants.main(CatVariants.java:312) [2016-07-07T13:54Z] ##### ERROR ------------------------------------------------------------------------------------------ [2016-07-07T13:54Z] ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.5-21-g80158d2): [2016-07-07T13:54Z] ##### ERROR [2016-07-07T13:54Z] ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. [2016-07-07T13:54Z] ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. [2016-07-07T13:54Z] ##### ERROR Visit our website and forum for extensive documentation and answers to [2016-07-07T13:54Z] ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk [2016-07-07T13:54Z] ##### ERROR [2016-07-07T13:54Z] ##### ERROR MESSAGE: Key ADP found in VariantContext field INFO at chr1:3300162 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default. [2016-07-07T13:54Z] ##### ERROR ------------------------------------------------------------------------------------------ [2016-07-07T13:54Z] bcftools concat variants [2016-07-07T13:54Z] Different number of samples in /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result/work/varscan/chr1/4615_S3-chr1_0_16221490.vcf.gz. Perhaps "bcftools merge" is what you are looking for? [2016-07-07T13:54Z] Uncaught exception occurred Traceback (most recent call last): File "/project/ibilab/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run _do_run(cmd, checks, log_stdout) File "/project/ibilab/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run raise subprocess.CalledProcessError(exitcode, error_msg) CalledProcessError: Command 'set -o pipefail; /project/ibilab/tools/bcbio/anaconda/bin/bcftools concat --allow-overlaps -O z --file-list /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result/work/varscan/4615_S3-files.list -o /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result/work/varscan/tx/tmp0lokL4/4615_S3.vcf.gz Different number of samples in /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result/work/varscan/chr1/4615_S3-chr1_0_16221490.vcf.gz. Perhaps "bcftools merge" is what you are looking for? ' returned non-zero exit status 255 ######################################################################

So, I tried to run the following on my command prompt to check whether it works or not.

/project/ibilab/tools/bcbio/anaconda/bin/bcftools concat --allow-overlaps -O z --file-list /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result/work/varscan/4615_S3-files.list -o /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result/work/varscan/testVarscan/4615_S3.vcf.gz Different number of samples in /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result/work/varscan/chr1/4615_S3-chr1_0_16221490.vcf.gz. Perhaps "bcftools merge" is what you are looking for?

I have no result and no detailed error message. I have no idea what to do next. Could you help me out? Thank you!

hyong2000 commented 8 years ago

I found a clue for this error from the error message below.

ERROR MESSAGE: Key ADP found in VariantContext field INFO at chr1:3300162 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default.

I checked a command for this error.

/project/ibilab/tools/bcbio/anaconda/bin/gatk-framework org.broadinstitute.gatk.tools.CatVariants -R /project/ibilab/tools/bcbio/genomes/Hsapiens/hg19/seq/hg19.fa -V /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result/work/varscan/4615_S3-files.list -out /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result/work/varscan/testVarscan/4615_S3.vcf.gz -assumeSorted -Xms750m -Xmx2000m -XX:+UseSerialGC To find the VCF file that generates the error, I checked the list of variants calling files from /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result/work/varscan/4615_S3-files.list. In the list, I manually checked files from the beginning and I found that the very first file does not include VCF headers nor variants (shown below).

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 4615_S3

Other than this file, they include VCF headers and variants as followings.

##fileformat=VCFv4.1
##source=VarScan2
##INFO=<ID=ADP,Number=1,Type=Integer,Description="Average per-sample depth of bases with Phred score >= 15">
##INFO=<ID=WT,Number=1,Type=Integer,Description="Number of samples called reference (wild-type)">
##INFO=<ID=HET,Number=1,Type=Integer,Description="Number of samples called heterozygous-variant">
##INFO=<ID=HOM,Number=1,Type=Integer,Description="Number of samples called homozygous-variant">
##INFO=<ID=NC,Number=1,Type=Integer,Description="Number of samples not called">
##FILTER=<ID=str10,Description="Less than 10% or more than 90% of variant supporting reads on one strand">
##FILTER=<ID=indelError,Description="Likely artifact due to indel reads at this position">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=SDP,Number=1,Type=Integer,Description="Raw Read Depth as reported by SAMtools">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Quality Read Depth of bases with Phred score >= 15">
##FORMAT=<ID=RD,Number=1,Type=Integer,Description="Depth of reference-supporting bases (reads1)">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Depth of variant-supporting bases (reads2)">
##FORMAT=<ID=FREQ,Number=1,Type=Float,Description="Variant allele frequency">
##FORMAT=<ID=PVAL,Number=1,Type=String,Description="P-value from Fisher's Exact Test">
##FORMAT=<ID=RBQ,Number=1,Type=Integer,Description="Average quality of reference-supporting bases (qual1)">
##FORMAT=<ID=ABQ,Number=1,Type=Integer,Description="Average quality of variant-supporting bases (qual2)">
##FORMAT=<ID=RDF,Number=1,Type=Integer,Description="Depth of reference-supporting bases on forward strand (reads1plus)">
##FORMAT=<ID=RDR,Number=1,Type=Integer,Description="Depth of reference-supporting bases on reverse strand (reads1minus)">
##FORMAT=<ID=ADF,Number=1,Type=Integer,Description="Depth of variant-supporting bases on forward strand (reads2plus)">
##FORMAT=<ID=ADR,Number=1,Type=Integer,Description="Depth of variant-supporting bases on reverse strand (reads2minus)">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  4615_S3
chr1    3300162 .       C       T       0       PASS    ADP=22;HET=1;HOM=0;NC=0;WT=0    GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR    0/1:6:22:22:20:2:9.09%:2.4419E-1:16:15:20:0:2:0

For a test, I added the VCF header and a fake variant into the first VCF file and I found that the command worked well! So, my guess is that there should not be an empty variant file in the list of the variant file at /project/ibilab/projects/Martin_Carroll_DNAseqOfAML_Anne_2016_2/large_data/bcbio-result/work/varscan/4615_S3-files.list. Please fix this problem and let me know if you have any question. Thank you!

chapmanb commented 8 years ago

Taehyong; Thank you for the report and all the work tracking down the underlying issue. To fix this it looks like we'd need to add a call to vcf_has_variants (https://github.com/chapmanb/bcbio-nextgen/blob/abfa4b323af5d70cf3fde7e7c77308c2646b6216/bcbio/variation/vcfutils.py#L294) and remove files from the merge list that have no variants here:

https://github.com/chapmanb/bcbio-nextgen/blob/abfa4b323af5d70cf3fde7e7c77308c2646b6216/bcbio/variation/vcfutils.py#L206

If you feel up for testing that fixes your issue and sending in a pull request I'd be happy to merge it. Thank you again for all the work tracking down the underlying issue.

hyong2000 commented 8 years ago

Sure! I would like to try it whether it fixes my issue. Should I just upgrade bcbio and rerun it against my samples?

bcbio_nextgen.py upgrade -u development

chapmanb commented 8 years ago

Taehyong -- sorry for the confusion, I was trying to give details hoping you might want to fix and provide a pull request yourself. Happy to also take care of this myself and I pushed a fix now which should resolve it. If you update from development as you describe above you should be able to test. Hope this gets things working for you.

hyong2000 commented 8 years ago

Thank you! But I have similar problem for mutect2. When there is no called variant in a certain region of a chromosome, it makes the problem. I think it would be better to double-check codes of merging steps across different variant callers. Please find the error trace below. Thank you!!

[2016-07-17T14:22Z] Concat variant files
[2016-07-17T14:22Z] INFO  10:22:45,079 HelpFormatter - -------------------------------------------------------
[2016-07-17T14:22Z] INFO  10:22:45,081 HelpFormatter - Program Name: org.broadinstitute.gatk.tools.CatVariants
[2016-07-17T14:22Z] INFO  10:22:45,085 HelpFormatter - Program Args: -R /project/ibilab/tools/bcbio/genomes/Hsapiens/hg19/seq/hg19.fa -V /project/ibilab/projects/Devraj_Basu_HNSCC_Devraj_2016_3/large_data/bcbio-result5/work/mutect2/720-files.list -out /project/ibilab/projects/Devraj_Basu_HNSCC_Devraj_2016_3/large_data/bcbio-result5/work/mutect2/tx/tmp_O87mm/720.vcf.gz -assumeSorted
[2016-07-17T14:22Z] INFO  10:22:45,091 HelpFormatter - Executing as taehyong@node119.hpc.local on Linux 2.6.32-504.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_91-b15.
[2016-07-17T14:22Z] INFO  10:22:45,092 HelpFormatter - Date/Time: 2016/07/17 10:22:45
[2016-07-17T14:22Z] INFO  10:22:45,092 HelpFormatter - -------------------------------------------------------
[2016-07-17T14:22Z] INFO  10:22:45,092 HelpFormatter - -------------------------------------------------------
[2016-07-17T14:22Z] .........10.........20.........30.........40.........50.........60.........70.........80.........90.........100.........110.........120.........130.........140.........150.........160.........170.##### ERROR ------------------------------------------------------------------------------------------
[2016-07-17T14:22Z] ##### ERROR stack trace
[2016-07-17T14:22Z] htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to parse header with error: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file, for input source: /project/ibilab/projects/Devraj_Basu_HNSCC_Devraj_2016_3/large_data/bcbio-result5/work/mutect2/chr17/720-chr17_78458800_81195210.vcf.gz
[2016-07-17T14:22Z]     at htsjdk.tribble.TabixFeatureReader.readHeader(TabixFeatureReader.java:90)
[2016-07-17T14:22Z]     at htsjdk.tribble.TabixFeatureReader.<init>(TabixFeatureReader.java:74)
[2016-07-17T14:22Z]     at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:85)
[2016-07-17T14:22Z]     at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:66)
[2016-07-17T14:22Z]     at org.broadinstitute.gatk.tools.CatVariants.getFeatureReader(CatVariants.java:188)
[2016-07-17T14:22Z]     at org.broadinstitute.gatk.tools.CatVariants.execute(CatVariants.java:280)
[2016-07-17T14:22Z]     at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
[2016-07-17T14:22Z]     at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
[2016-07-17T14:22Z]     at org.broadinstitute.gatk.tools.CatVariants.main(CatVariants.java:312)
[2016-07-17T14:22Z] Caused by: htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file
[2016-07-17T14:22Z]     at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:119)
[2016-07-17T14:22Z]     at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:88)
[2016-07-17T14:22Z]     at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:41)
[2016-07-17T14:22Z]     at htsjdk.tribble.TabixFeatureReader.readHeader(TabixFeatureReader.java:88)
[2016-07-17T14:22Z]     ... 8 more
[2016-07-17T14:22Z] ##### ERROR ------------------------------------------------------------------------------------------
[2016-07-17T14:22Z] ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.5-21-g80158d2):
[2016-07-17T14:22Z] ##### ERROR
[2016-07-17T14:22Z] ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
[2016-07-17T14:22Z] ##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
[2016-07-17T14:22Z] ##### ERROR Visit our website and forum for extensive documentation and answers to
[2016-07-17T14:22Z] ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
[2016-07-17T14:22Z] ##### ERROR
[2016-07-17T14:22Z] ##### ERROR MESSAGE: Unable to parse header with error: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file, for input source: /project/ibilab/projects/Devraj_Basu_HNSCC_Devraj_2016_3/large_data/bcbio-result5/work/mutect2/chr17/720-chr17_78458800_81195210.vcf.gz
[2016-07-17T14:22Z] ##### ERROR ------------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/project/ibilab/tools/bcbio/tool/bin/bcbio_nextgen.py", line 226, in <module>
    main(**kwargs)
  File "/project/ibilab/tools/bcbio/tool/bin/bcbio_nextgen.py", line 43, in main
    run_main(**kwargs)
  File "/project/ibilab/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 43, in run_main
    fc_dir, run_info_yaml)
  File "/project/ibilab/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 87, in _run_toplevel
    for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
  File "/project/ibilab/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 151, in variant2pipeline
    samples = genotype.parallel_variantcall_region(samples, run_parallel)
  File "/project/ibilab/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/variation/genotype.py", line 176, in parallel_variantcall_region
    "variantcall_sample", "concat_variant_files",
  File "/project/ibilab/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/split.py", line 35, in grouped_parallel_split_combine
    parallel_fn(combine_name, combine_args)
  File "/project/ibilab/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
    return run_multicore(fn, items, config, parallel=parallel)
  File "/project/ibilab/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore
    for data in joblib.Parallel(parallel["num_jobs"])(joblib.delayed(fn)(x) for x in items):
  File "/project/ibilab/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 800, in __call__
    while self.dispatch_one_batch(iterator):
  File "/project/ibilab/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 658, in dispatch_one_batch
    self._dispatch(tasks)
  File "/project/ibilab/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 566, in _dispatch
    job = ImmediateComputeBatch(batch)
  File "/project/ibilab/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 180, in __init__
    self.results = batch()
  File "/project/ibilab/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 72, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/project/ibilab/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 51, in wrapper
    return apply(f, *args, **kwargs)
  File "/project/ibilab/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 199, in concat_variant_files
    def variantcall_sample(*args):
  File "/project/ibilab/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/variation/vcfutils.py", line 301, in concat_variant_files
    out_handle.write(fname + "\n")
  File "/project/ibilab/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "/project/ibilab/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
subprocess.CalledProcessError: Command '/project/ibilab/tools/bcbio/anaconda/bin/gatk-framework org.broadinstitute.gatk.tools.CatVariants -R /project/ibilab/tools/bcbio/genomes/Hsapiens/hg19/seq/hg19.fa -V /project/ibilab/projects/Devraj_Basu_HNSCC_Devraj_2016_3/large_data/bcbio-result5/work/mutect2/720-files.list -out /project/ibilab/projects/Devraj_Basu_HNSCC_Devraj_2016_3/large_data/bcbio-result5/work/mutect2/tx/tmp_O87mm/720.vcf.gz -assumeSorted -Xms750m -Xmx2000m -XX:+UseSerialGC
INFO  10:22:45,079 HelpFormatter - ------------------------------------------------------- 
INFO  10:22:45,081 HelpFormatter - Program Name: org.broadinstitute.gatk.tools.CatVariants 
INFO  10:22:45,085 HelpFormatter - Program Args: -R /project/ibilab/tools/bcbio/genomes/Hsapiens/hg19/seq/hg19.fa -V /project/ibilab/projects/Devraj_Basu_HNSCC_Devraj_2016_3/large_data/bcbio-result5/work/mutect2/720-files.list -out /project/ibilab/projects/Devraj_Basu_HNSCC_Devraj_2016_3/large_data/bcbio-result5/work/mutect2/tx/tmp_O87mm/720.vcf.gz -assumeSorted 
INFO  10:22:45,091 HelpFormatter - Executing as taehyong@node119.hpc.local on Linux 2.6.32-504.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_91-b15. 
INFO  10:22:45,092 HelpFormatter - Date/Time: 2016/07/17 10:22:45 
INFO  10:22:45,092 HelpFormatter - ------------------------------------------------------- 
INFO  10:22:45,092 HelpFormatter - ------------------------------------------------------- 
.........10.........20.........30.........40.........50.........60.........70.........80.........90.........100.........110.........120.........130.........140.........150.........160.........170.##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to parse header with error: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file, for input source: /project/ibilab/projects/Devraj_Basu_HNSCC_Devraj_2016_3/large_data/bcbio-result5/work/mutect2/chr17/720-chr17_78458800_81195210.vcf.gz
    at htsjdk.tribble.TabixFeatureReader.readHeader(TabixFeatureReader.java:90)
    at htsjdk.tribble.TabixFeatureReader.<init>(TabixFeatureReader.java:74)
    at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:85)
    at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:66)
    at org.broadinstitute.gatk.tools.CatVariants.getFeatureReader(CatVariants.java:188)
    at org.broadinstitute.gatk.tools.CatVariants.execute(CatVariants.java:280)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
    at org.broadinstitute.gatk.tools.CatVariants.main(CatVariants.java:312)
Caused by: htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file
    at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:119)
    at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:88)
    at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:41)
    at htsjdk.tribble.TabixFeatureReader.readHeader(TabixFeatureReader.java:88)
    ... 8 more
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.5-21-g80158d2):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Unable to parse header with error: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file, for input source: /project/ibilab/projects/Devraj_Basu_HNSCC_Devraj_2016_3/large_data/bcbio-result5/work/mutect2/chr17/720-chr17_78458800_81195210.vcf.gz
##### ERROR ------------------------------------------------------------------------------------------
' returned non-zero exit status 1
chapmanb commented 8 years ago

Taehyong -- sorry about the MuTect2 problems. We haven't seen this in testing so I'm not sure what the underlying issue is. From the errors, it looks like something is problematic about one of the input files:

/project/ibilab/projects/Devraj_Basu_HNSCC_Devraj_2016_3/large_data/bcbio-result5/work/mutect2/chr17/720-chr17_78458800_81195210.vcf.gz

Would it be possible to share that file with us? Understanding what went wrong would help us either work around it or submit an issue upstream to the MuTect2 team. Thanks much for the help debugging.

hyong2000 commented 8 years ago

Thanks for looking this! 720-chr17_78458800_81195210.vcf.gz is an empty file. I don't think it is a problem of MuTect2 but it is a problem when variant result files were concatenated with an empty result file, which create an error due to no header or variants in it (an empty file). Please see the error message.

ERROR MESSAGE: Unable to parse header with error: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file, for input source: /project/ibilab/projects/Devraj_Basu_HNSCC_Devraj_2016_3/large_data/bcbio-result5/work/mutect2/chr17/720-chr17_78458800_81195210.vcf.gz

If you need any other information, please let me know. Thank you!!

chapmanb commented 8 years ago

Taehyong, thanks for the quick followup with details about that file. The underlying issue is that MuTect2 is producing an empty file here -- it should produce a VCF header with no variants, not a completely empty file. The concatenate variants fails on this, but the real error is upstream. As an aside, do you have the latest development version (or stable release from this morning)? That should also skip over this and the traceback looks like it might be from an older version.

Practically, it would be helpful to see if you can reproduce the GATK MuTect2 empty file error outside of bcbio. If you do:

grep MuTect2 log/bcbio-variation-commands.log | grep 720-chr17_78458800_81195210.vcf.gz

You hopefully can find the initial command that created this. If you re-run that, can you replicate the creation of an empty file? If so, you might want to upgrade to GATK 3.6 (also supported in the latest development) and see if that also has a problem. If so, we should try to prepare and file a bug upstream to the GATK team.

Thanks again for the help debugging.

hyong2000 commented 8 years ago

Thanks for quick reply! I sort-of agree on this. But please consider what I thought to fix the problem. (Please also check the discussion above of this thread since this can be solved with the same way).

As you can see in the following command line, CatVariants tries to concatenate variants result files from the list, '/project/ibilab/projects/Devraj_Basu_HNSCC_Devraj_2016_3/large_data/bcbio-result5/work/mutect2/720-files.list', which includes the empty file, 720-chr17_78458800_81195210.vcf.gz. So, I think that it would be better to check whether variant results file has actual variants or not (empty or not) when 720-files.list is created. This could be a better way to avoid error (concatenating an empty file) regardless of generating empty files from GATK.

/project/ibilab/tools/bcbio/anaconda/bin/gatk-framework org.broadinstitute.gatk.tools.CatVariants -R /project/ibilab/tools/bcbio/genomes/Hsapiens/hg19/seq/hg19.fa -V /project/ibilab/projects/Devraj_Basu_HNSCC_Devraj_2016_3/large_data/bcbio-result5/work/mutect2/720-files.list -out /project/ibilab/projects/Devraj_Basu_HNSCC_Devraj_2016_3/large_data/bcbio-result5/work/mutect2/tx/tmp_O87mm/720.vcf.gz -assumeSorted -Xms750m -Xmx2000m -XX:+UseSerialGC

chapmanb commented 8 years ago

Taehyong -- I definitely agree with you that it should catch this during the concatenation, and the fix I put in place earlier for the other problem should also avoid using empty files. Is it possible you don't have the latest development version with this fix in place? Hopefully upgrading to the latest development will skip over this if you re-try.

hyong2000 commented 8 years ago

Right! I upgraded to 0.9.9 and it works! Thank you so much!