bcbio / bcbio-nextgen-vm

Run bcbio-nextgen genomic sequencing analyses using isolated containers and virtual machines
MIT License
65 stars 17 forks source link

bam_clean error: Bad Input #148

Closed mortunco closed 8 years ago

mortunco commented 8 years ago

Hi,

I have two bam files(normal-tumor). I would like to call variants across these samples. Since I dont require alignment I preferred bam clean and bam sort as it is suggested in the bcbio documentation. As I looked for the error, I dont know in which point I piped the wrong input. These bam files are the bam files from ICGC consortium so I dont know what I can do? Pass bam sort and clean ?

During the run I came across with the following error;

[2016-04-17T12:27Z] Timing: organize samples
[2016-04-17T12:27Z] multiprocessing: organize_samples
[2016-04-17T12:27Z] Using input YAML configuration: /mnt/work/bcbio_sample-forvm.yaml
[2016-04-17T12:44Z] Checking sample YAML configuration: /mnt/work/bcbio_sample-forvm.yaml
[2016-04-17T12:44Z] Downloading GRCh37 samtools from AWS
[2016-04-17T12:46Z] Downloading GRCh37 samtools from AWS
[2016-04-17T12:46Z] Downloading GRCh37 samtools from AWS
[2016-04-17T12:46Z] Testing minimum versions of installed programs
[2016-04-17T12:46Z] Timing: alignment preparation
[2016-04-17T12:46Z] multiprocessing: prep_align_inputs
[2016-04-17T12:46Z] multiprocessing: disambiguate_split
[2016-04-17T12:46Z] Timing: alignment
[2016-04-17T12:46Z] multiprocessing: process_alignment
[2016-04-17T22:07Z] Uncaught exception occurred
Traceback (most recent call last):
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command '/usr/local/share/bcbio-nextgen/anaconda/bin/gatk-framework -Xms750m -Xmx1600m -XX:+UseSerialGC -Djava.io.tmpdir=/mnt/work/tx/tmpQp9Xj2 -U LENIENT_VCF_PROCESSING --read_filter BadCigar --read_filter NotPrimaryAlignment -T PrintReads -R /mnt/work/inputs/data/genomes/GRCh37/seq/GRCh37.fa -I /mnt/work/bamclean/DO51159-Normal/normal-reorder-fixrgs.bam --out /mnt/work/bamclean/DO51159-Normal/tx/tmpbrspAF/normal-reorder-fixrgs-gatkfilter.bam --filter_mismatching_base_and_quals --filter_bases_not_stored --filter_reads_with_N_cigar --fix_misencoded_quality_scores
/usr/local/share/bcbio-nextgen/anaconda/bin/gatk-framework: line 7: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8): No such file or directory
INFO  22:07:02,266 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  22:07:02,276 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-65-g2434e49, Compiled 2015/10/09 18:46:40 
INFO  22:07:02,276 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  22:07:02,276 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  22:07:02,279 HelpFormatter - Program Args: -U LENIENT_VCF_PROCESSING --read_filter BadCigar --read_filter NotPrimaryAlignment -T PrintReads -R /mnt/work/inputs/data/genomes/GRCh37/seq/GRCh37.fa -I /mnt/work/bamclean/DO51159-Normal/normal-reorder-fixrgs.bam --out /mnt/work/bamclean/DO51159-Normal/tx/tmpbrspAF/normal-reorder-fixrgs-gatkfilter.bam --filter_mismatching_base_and_quals --filter_bases_not_stored --filter_reads_with_N_cigar --fix_misencoded_quality_scores 
INFO  22:07:02,349 HelpFormatter - Executing as ubuntu@frontend001 on Linux 3.13.0-85-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_95-b00. 
INFO  22:07:02,349 HelpFormatter - Date/Time: 2016/04/17 22:07:02 
INFO  22:07:02,350 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  22:07:02,350 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  22:07:02,803 GenomeAnalysisEngine - Strictness is SILENT 
INFO  22:07:02,930 GenomeAnalysisEngine - Downsampling Settings: No downsampling 
INFO  22:07:02,937 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  22:07:02,989 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.04 
INFO  22:07:03,166 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files 
INFO  22:07:03,169 GenomeAnalysisEngine - Done preparing for traversal 
INFO  22:07:03,169 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  22:07:03,170 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  22:07:03,170 ProgressMeter -        Location |     reads | elapsed |     reads | completed | runtime |   runtime 
INFO  22:07:03,177 ReadShardBalancer$1 - Loading BAM index data 
INFO  22:07:03,179 ReadShardBalancer$1 - Done loading BAM index data 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.4-65-g2434e49): 
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool
##### ERROR ------------------------------------------------------------------------------------------
' returned non-zero exit status 1
Uncaught exception occurred
Traceback (most recent call last):
  File "/home/ubuntu/install/bcbio-vm/data/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "/home/ubuntu/install/bcbio-vm/data/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'docker attach --no-stdin 4b75acae24049a12503db425d9fe7b1ce72a9ac7bf9b143fc002cc73407c7376
INFO  22:07:03,169 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  22:07:03,170 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  22:07:03,170 ProgressMeter -        Location |     reads | elapsed |     reads | completed | runtime |   runtime 
INFO  22:07:03,177 ReadShardBalancer$1 - Loading BAM index data 
INFO  22:07:03,179 ReadShardBalancer$1 - Done loading BAM index data 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.4-65-g2434e49): 
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool
##### ERROR ------------------------------------------------------------------------------------------
' returned non-zero exit status 1
Traceback (most recent call last):
  File "/usr/local/bin/bcbio_nextgen.py", line 226, in <module>
    main(**kwargs)
  File "/usr/local/bin/bcbio_nextgen.py", line 43, in main
    run_main(**kwargs)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 39, in run_main
    fc_dir, run_info_yaml)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 82, in _run_toplevel
    for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 123, in variant2pipeline
    samples = run_parallel("process_alignment", samples)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
    return run_multicore(fn, items, config, parallel=parallel)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore
    for data in joblib.Parallel(parallel["num_jobs"])(joblib.delayed(fn)(x) for x in items):
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 800, in __call__
    while self.dispatch_one_batch(iterator):
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 658, in dispatch_one_batch
    self._dispatch(tasks)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 566, in _dispatch
    job = ImmediateComputeBatch(batch)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 180, in __init__
    self.results = batch()
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 72, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 51, in wrapper
    return apply(f, *args, **kwargs)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 67, in process_alignment
    return sample.process_alignment(*args)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/sample.py", line 120, in process_alignment
    data)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/cleanbam.py", line 30, in picard_prep
    return _filter_bad_reads(rg_bam, ref_file, data)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/cleanbam.py", line 51, in _filter_bad_reads
    do.run(cmd, "Filter problem reads")
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
subprocess.CalledProcessError: Command '/usr/local/share/bcbio-nextgen/anaconda/bin/gatk-framework -Xms750m -Xmx1600m -XX:+UseSerialGC -Djava.io.tmpdir=/mnt/work/tx/tmpQp9Xj2 -U LENIENT_VCF_PROCESSING --read_filter BadCigar --read_filter NotPrimaryAlignment -T PrintReads -R /mnt/work/inputs/data/genomes/GRCh37/seq/GRCh37.fa -I /mnt/work/bamclean/DO51159-Normal/normal-reorder-fixrgs.bam --out /mnt/work/bamclean/DO51159-Normal/tx/tmpbrspAF/normal-reorder-fixrgs-gatkfilter.bam --filter_mismatching_base_and_quals --filter_bases_not_stored --filter_reads_with_N_cigar --fix_misencoded_quality_scores
/usr/local/share/bcbio-nextgen/anaconda/bin/gatk-framework: line 7: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8): No such file or directory
INFO  22:07:02,266 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  22:07:02,276 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-65-g2434e49, Compiled 2015/10/09 18:46:40 
INFO  22:07:02,276 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  22:07:02,276 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  22:07:02,279 HelpFormatter - Program Args: -U LENIENT_VCF_PROCESSING --read_filter BadCigar --read_filter NotPrimaryAlignment -T PrintReads -R /mnt/work/inputs/data/genomes/GRCh37/seq/GRCh37.fa -I /mnt/work/bamclean/DO51159-Normal/normal-reorder-fixrgs.bam --out /mnt/work/bamclean/DO51159-Normal/tx/tmpbrspAF/normal-reorder-fixrgs-gatkfilter.bam --filter_mismatching_base_and_quals --filter_bases_not_stored --filter_reads_with_N_cigar --fix_misencoded_quality_scores 
INFO  22:07:02,349 HelpFormatter - Executing as ubuntu@frontend001 on Linux 3.13.0-85-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_95-b00. 
INFO  22:07:02,349 HelpFormatter - Date/Time: 2016/04/17 22:07:02 
INFO  22:07:02,350 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  22:07:02,350 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  22:07:02,803 GenomeAnalysisEngine - Strictness is SILENT 
INFO  22:07:02,930 GenomeAnalysisEngine - Downsampling Settings: No downsampling 
INFO  22:07:02,937 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  22:07:02,989 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.04 
INFO  22:07:03,166 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files 
INFO  22:07:03,169 GenomeAnalysisEngine - Done preparing for traversal 
INFO  22:07:03,169 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  22:07:03,170 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  22:07:03,170 ProgressMeter -        Location |     reads | elapsed |     reads | completed | runtime |   runtime 
INFO  22:07:03,177 ReadShardBalancer$1 - Loading BAM index data 
INFO  22:07:03,179 ReadShardBalancer$1 - Done loading BAM index data 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.4-65-g2434e49): 
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool
##### ERROR ------------------------------------------------------------------------------------------
' returned non-zero exit status 1
' returned non-zero exit status 1

Also I would like to share my config file, maybe i did something wrong

ubuntu@frontend001:/encrypted/project9/work/config$ cat deneme7.yaml 
details:
- algorithm:
    aligner: false
    bam_clean: picard
    bam_sort: coordinate
    ensemble:
      numpass: 2
    indelcaller: scalpel
    platform: illumina
    quality_format: illumina
    realign: false
    recalibrate: false
    remove_lcr: true
    variantcaller:
    - mutect
    - vardict
    - varscan
    - freebayes
  analysis: variant2
  description: DO51159-Normal
  files:
  - s3://tuncproject/icgcrun/input/normal.bam
  genome_build: GRCh37
  metadata:
    batch: ICGC
    phenotype: normal
- algorithm:
    aligner: false
    bam_clean: picard
    bam_sort: coordinate
    ensemble:
      numpass: 2
    indelcaller: scalpel
    platform: illumina
    quality_format: illumina
    realign: false
    recalibrate: false
    remove_lcr: true
    variantcaller:
    - mutect
    - vardict
    - varscan
    - freebayes
  analysis: variant2
  description: DO51159-Tumor
  files:
  - s3://tuncproject/icgcrun/input/tumor.bam
  genome_build: GRCh37
  metadata:
    batch: ICGC
    phenotype: tumor
fc_date: '2015-04-14'
fc_name: ICGC-trials
resources:
  gatk:
    jar: s3://tuncproject/gatktools/GenomeAnalysisTK.jar
  mutect:
    jar: s3://tuncproject/gatktools/mutect-1.1.7.jar
upload:
  bucket: tuncproject
  dir: ../final
  folder: icgcrun/input/final
  method: s3
  region: us-east-1
chapmanb commented 8 years ago

Tunc; You're exactly right in your diagnosis -- there is something problematic about the ICGC BAM files that GATK does not like. My suggestion would be to align with bwa through bcbio so you get clean inputs:

aligner: bwa

If you don't want to do that you could also try cleaning them as you suggest by adding:

bam_clean: picard

This may take some manual work on your side to clean up the quality scores in the input files if neither of these work cleanly and the original file is a mix of multiple quality encoding types, but fingers crossed one of these will work. Hope this helps.

mortunco commented 8 years ago

Dear Brad;

If you may look at my configuration files, I have already settled bam_clean: picard. Did you mean leaving bam_clean only there and remove `bam_sort`` or trying that option. Because I have already included them in my configuration file and got this option.

Thank you,

Tunc

chapmanb commented 8 years ago

Tunc; My mistake, sorry. I guess the cleanup we have is not handling this input file. I don't believe bam_sort will help with this problem since it looks like it's due to misencoded quality scores. Hopefully realigning with bwa will resolve the issue.