Closed biocyberman closed 10 years ago
This is one entry of my YAML file:
- algorithm:
aligner: false
coverage_depth: high
coverage_interval: exome
mark_duplicates: picard
platform: SOLiD
quality_format: Standard
realign: gatk
recalibrate: gatk
variant_regions: /input/SeqCap_EZ_Exome_v3_primary_GRCh37.bed
variantcaller: [gatk,freebayes]
analysis: variant2
metadata:
batch: WES1_Joined
description: BIN1_1
files: ../input/EXOME_1/WES1.join.lsc.BIN1_1-1-1.bam
genome_build: GRCh37wd
According to this thread on GATK forum, http://gatkforums.broadinstitute.org/discussion/3382/no-error-or-warning-for-ambiguous-bases: GATK ignores variant calling on ambiguous reference bases. In my case, using Freebayes does not seem to do the same. So Freebayes calls variant for the R
reference base. This output then is processed with GATK and hence the error.
Thanks much for the report. You're right on, this is an incompatibility between FreeBayes and GATK's view of the world. FreeBayes will call on these non-N ambiguous bases where any GATK post-processing steps won't like this.
To practically solve this issue, are you trying to get calls in the decoy regions? If not, you should be able to adjust your variant_regions
file to not have intervals in hs37d5 and then it won't try to call there. I haven't run into these style of ambiguous bases outside of the decoy inputs.
If you need calls in hs37d5, then we could try to post-filter the FreeBayes output to avoid these issues but I'd rather not add in additional sed style steps if we could avoid it.
What do you think?
Thanks for taking a look in to the issue. I actually do not have hs37d5 in the variantregions because the bed file only covers amplicon regions of canonical chromosomes in UCSC's hg19 release. It does not even have additional contigs GL0, NC__ of GRCh37 distributed via Broad's website. Your comment brought my attention to analysis_blocks.bed. This file is generated during bcbio-nextgen run. It does not follow strictly intervals in the variant_regions
files and adds bonus
chromosomes/contigs, each whole contig as one interval. For example:
GL000225.1 0 211173
GL000192.1 0 547496
NC_007605 0 171823
hs37d5 0 35477943
This behavior may be nice on other cases, but not desirable in my case.
I am thinking that it is possible to make a temporary fix like you suggested, but to the analysis_blocks.bed
file instead.
Thanks for the follow up. My apologies, I tested this on a small subset and it was excluding these extra haplotype chromosomes if they were not present in the variant_regions
file, but there was a bug when doing it on whole genomes. I pushed a fix, so if you update with bcbio_nextgen.py upgrade -u development
, remove the old analysis regions file ( rm analysis_regions.bed
) and re-run, it should now correctly exclude variant calling from these regions and avoid the issue.
Thanks again for the help debugging and let me know if you run into any other problems.
Good news is that the run could pass the point it stopped before. But I encounter one more problem still. It has something to do with GATK expectations in its new version:
/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/utils.pyc in wrapper(*args=([{'analysis': 'variant2', 'callable_bam': '/work/prealign/BIN1_1/WES1.join.lsc.BIN1_1-1-1.bam', 'combine': {'work_bam': {'extras': ['/work/bampre...S1.join.lsc.BIN1_1-1-1-1_1569791_3160760-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_3297357_5876029-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_5923272_7528025-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_7700384_9305666-prep.bam', '/work/bampre...1.join.lsc.BIN1_1-1-1-1_9306961_11008911-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_11009653_12628510-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_12632745_14220310-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_14506886_16091783-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_16093812_17662772-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_17664469_19235227-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_19236778_20809310-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_20809548_22379276-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_22404921_24018401-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_24019021_25599232-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_25608351_27177172-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_27177569_28765018-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_28784253_30357958-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_31185983_32757878-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_32768177_34374551-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_34383646_35972774-prep.bam', ...], 'out': '/work/bamprep/BIN1_1/WES1.join.lsc.BIN1_1-1-1-prep.bam'}}, 'config': {'algorithm': {'aligner': False, 'callable_regions': '/work/analysis_blocks.bed', 'coverage_depth': 'high', 'coverage_interval': 'exome', 'mark_duplicates': 'picard', 'num_cores': 1, 'orig_variantcaller': ['gatk', 'freebayes'], 'platform': 'SOLiD', 'quality_format': 'Standard', 'realign': 'gatk', ...}, 'resources': {'AlienTrimmer': {'dir': '/usr/local/share/bcbio-nextgen/share/java/AlienTrimmer', 'jvm_opts': [...]}, 'alientrimmer': {'dir': '/usr/local/share/bcbio-nextgen/share/java/AlienTrimmer', 'jvm_opts': [...]}, 'bcbio_variation': {'dir': '/usr/local/share/bcbio-nextgen/share/java/bcbio_variation', 'jvm_opts': [...]}, 'bowtie': {'cores': 16}, 'bowtie2': {'cores': 16}, 'bwa': {'cmd': 'bwa', 'cores': 16}, 'cufflinks': {'cores': 16}, 'freebayes': {'memory': '2g'}, 'gatk': {'dir': '/usr/local/share/bcbio-nextgen/toolplus/gatk/3.1-1-g07a4bf8', 'jvm_opts': [...]}, 'gatk-haplotype': {'jvm_opts': [...]}, ...}}, 'description': 'BIN1_1', 'dirs': {'fastq': None, 'galaxy': '/usr/local/share/bcbio-nextgen/galaxy', 'work': '/work'}, 'files': ['/input/EXOME_1/WES1.join.lsc.BIN1_1-1-1.bam'], 'genome_build': 'GRCh37wd', 'genome_resources': {'aliases': {'ensembl': 'human', 'human': True, 'snpeff': 'GRCh37.74'}, 'rnaseq': {'transcriptome_index': {'tophat': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/tophat/GRCh37_transcriptome.ver'}, 'transcripts': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/ref-transcripts.gtf', 'transcripts_mask': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/ref-transcripts-mask.gtf'}, 'variation': {'cosmic': '../variation/cosmic-v67_20131024-GRCh37.vcf', 'dbsnp': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/variation/dbsnp_138.vcf', 'train_1000g_omni': '../variation/1000G_omni2.5.vcf', 'train_hapmap': '../variation/hapmap_3.3.vcf', 'train_indels': '../variation/Mills_and_1000G_gold_standard.indels.vcf'}, 'version': 7}, 'lane': '1', ...}],), **kwargs={})
42 def map_wrap(f):
43 """Wrap standard function to easily pass into 'map' processing.
44 """
45 @functools.wraps(f)
46 def wrapper(*args, **kwargs):
---> 47 return apply(f, *args, **kwargs)
48 return wrapper
49
50
51 def transform_to(ext):
...........................................................................
/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.pyc in postprocess_variants(*args=({'analysis': 'variant2', 'callable_bam': '/work/prealign/BIN1_1/WES1.join.lsc.BIN1_1-1-1.bam', 'combine': {'work_bam': {'extras': ['/work/bampre...S1.join.lsc.BIN1_1-1-1-1_1569791_3160760-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_3297357_5876029-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_5923272_7528025-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_7700384_9305666-prep.bam', '/work/bampre...1.join.lsc.BIN1_1-1-1-1_9306961_11008911-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_11009653_12628510-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_12632745_14220310-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_14506886_16091783-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_16093812_17662772-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_17664469_19235227-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_19236778_20809310-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_20809548_22379276-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_22404921_24018401-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_24019021_25599232-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_25608351_27177172-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_27177569_28765018-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_28784253_30357958-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_31185983_32757878-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_32768177_34374551-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_34383646_35972774-prep.bam', ...], 'out': '/work/bamprep/BIN1_1/WES1.join.lsc.BIN1_1-1-1-prep.bam'}}, 'config': {'algorithm': {'aligner': False, 'callable_regions': '/work/analysis_blocks.bed', 'coverage_depth': 'high', 'coverage_interval': 'exome', 'mark_duplicates': 'picard', 'num_cores': 1, 'orig_variantcaller': ['gatk', 'freebayes'], 'platform': 'SOLiD', 'quality_format': 'Standard', 'realign': 'gatk', ...}, 'resources': {'AlienTrimmer': {'dir': '/usr/local/share/bcbio-nextgen/share/java/AlienTrimmer', 'jvm_opts': ['-Xms750m', '-Xmx2000m']}, 'alientrimmer': {'dir': '/usr/local/share/bcbio-nextgen/share/java/AlienTrimmer', 'jvm_opts': ['-Xms750m', '-Xmx2000m']}, 'bcbio_variation': {'dir': '/usr/local/share/bcbio-nextgen/share/java/bcbio_variation', 'jvm_opts': ['-Xms750m', '-Xmx2500m']}, 'bowtie': {'cores': 16}, 'bowtie2': {'cores': 16}, 'bwa': {'cmd': 'bwa', 'cores': 16}, 'cufflinks': {'cores': 16}, 'freebayes': {'memory': '2g'}, 'gatk': {'dir': '/usr/local/share/bcbio-nextgen/toolplus/gatk/3.1-1-g07a4bf8', 'jvm_opts': ['-Xms750m', '-Xmx2500m']}, 'gatk-haplotype': {'jvm_opts': ['-Xms2g', '-Xmx5500m']}, ...}}, 'description': 'BIN1_1', 'dirs': {'fastq': None, 'galaxy': '/usr/local/share/bcbio-nextgen/galaxy', 'work': '/work'}, 'files': ['/input/EXOME_1/WES1.join.lsc.BIN1_1-1-1.bam'], 'genome_build': 'GRCh37wd', 'genome_resources': {'aliases': {'ensembl': 'human', 'human': True, 'snpeff': 'GRCh37.74'}, 'rnaseq': {'transcriptome_index': {'tophat': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/tophat/GRCh37_transcriptome.ver'}, 'transcripts': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/ref-transcripts.gtf', 'transcripts_mask': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/ref-transcripts-mask.gtf'}, 'variation': {'cosmic': '../variation/cosmic-v67_20131024-GRCh37.vcf', 'dbsnp': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/variation/dbsnp_138.vcf', 'train_1000g_omni': '../variation/1000G_omni2.5.vcf', 'train_hapmap': '../variation/hapmap_3.3.vcf', 'train_indels': '../variation/Mills_and_1000G_gold_standard.indels.vcf'}, 'version': 7}, 'lane': '1', ...},))
56 def split_variants_by_sample(*args):
57 return multi.split_variants_by_sample(*args)
58
59 @utils.map_wrap
60 def postprocess_variants(*args):
---> 61 return variation.postprocess_variants(*args)
62
63 @utils.map_wrap
64 def pipeline_summary(*args):
65 return qcsummary.pipeline_summary(*args)
...........................................................................
/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/variation.pyc in postprocess_variants(data={'analysis': 'variant2', 'callable_bam': '/work/prealign/BIN1_1/WES1.join.lsc.BIN1_1-1-1.bam', 'combine': {'work_bam': {'extras': ['/work/bampre...S1.join.lsc.BIN1_1-1-1-1_1569791_3160760-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_3297357_5876029-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_5923272_7528025-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_7700384_9305666-prep.bam', '/work/bampre...1.join.lsc.BIN1_1-1-1-1_9306961_11008911-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_11009653_12628510-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_12632745_14220310-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_14506886_16091783-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_16093812_17662772-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_17664469_19235227-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_19236778_20809310-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_20809548_22379276-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_22404921_24018401-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_24019021_25599232-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_25608351_27177172-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_27177569_28765018-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_28784253_30357958-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_31185983_32757878-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_32768177_34374551-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_34383646_35972774-prep.bam', ...], 'out': '/work/bamprep/BIN1_1/WES1.join.lsc.BIN1_1-1-1-prep.bam'}}, 'config': {'algorithm': {'aligner': False, 'callable_regions': '/work/analysis_blocks.bed', 'coverage_depth': 'high', 'coverage_interval': 'exome', 'mark_duplicates': 'picard', 'num_cores': 1, 'orig_variantcaller': ['gatk', 'freebayes'], 'platform': 'SOLiD', 'quality_format': 'Standard', 'realign': 'gatk', ...}, 'resources': {'AlienTrimmer': {'dir': '/usr/local/share/bcbio-nextgen/share/java/AlienTrimmer', 'jvm_opts': ['-Xms750m', '-Xmx2000m']}, 'alientrimmer': {'dir': '/usr/local/share/bcbio-nextgen/share/java/AlienTrimmer', 'jvm_opts': ['-Xms750m', '-Xmx2000m']}, 'bcbio_variation': {'dir': '/usr/local/share/bcbio-nextgen/share/java/bcbio_variation', 'jvm_opts': ['-Xms750m', '-Xmx2500m']}, 'bowtie': {'cores': 16}, 'bowtie2': {'cores': 16}, 'bwa': {'cmd': 'bwa', 'cores': 16}, 'cufflinks': {'cores': 16}, 'freebayes': {'memory': '2g'}, 'gatk': {'dir': '/usr/local/share/bcbio-nextgen/toolplus/gatk/3.1-1-g07a4bf8', 'jvm_opts': ['-Xms750m', '-Xmx2500m']}, 'gatk-haplotype': {'jvm_opts': ['-Xms2g', '-Xmx5500m']}, ...}}, 'description': 'BIN1_1', 'dirs': {'fastq': None, 'galaxy': '/usr/local/share/bcbio-nextgen/galaxy', 'work': '/work'}, 'files': ['/input/EXOME_1/WES1.join.lsc.BIN1_1-1-1.bam'], 'genome_build': 'GRCh37wd', 'genome_resources': {'aliases': {'ensembl': 'human', 'human': True, 'snpeff': 'GRCh37.74'}, 'rnaseq': {'transcriptome_index': {'tophat': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/tophat/GRCh37_transcriptome.ver'}, 'transcripts': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/ref-transcripts.gtf', 'transcripts_mask': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/ref-transcripts-mask.gtf'}, 'variation': {'cosmic': '../variation/cosmic-v67_20131024-GRCh37.vcf', 'dbsnp': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/variation/dbsnp_138.vcf', 'train_1000g_omni': '../variation/1000G_omni2.5.vcf', 'train_hapmap': '../variation/hapmap_3.3.vcf', 'train_indels': '../variation/Mills_and_1000G_gold_standard.indels.vcf'}, 'version': 7}, 'lane': '1', ...})
12 cur_name = "%s, %s" % (data["name"][-1], get_variantcaller(data))
13 logger.info("Finalizing variant calls: %s" % cur_name)
14 if data["work_bam"] and data.get("vrn_file"):
15 data["vrn_file"] = variant_filtration(data["vrn_file"], data["sam_ref"],
16 data["genome_resources"]["variation"],
---> 17 data)
18 logger.info("Calculating variation effects for %s" % cur_name)
19 ann_vrn_file = effects.snpeff_effects(data)
20 if ann_vrn_file:
21 data["vrn_file"] = ann_vrn_file
...........................................................................
/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/genotype.pyc in variant_filtration(call_file='/work/gatk/WES1.join.lsc.BIN1_1-1-1-variants.vcf.gz', ref_file='/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/seq/GRCh37wd.fa', vrn_files={'cosmic': '../variation/cosmic-v67_20131024-GRCh37.vcf', 'dbsnp': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/variation/dbsnp_138.vcf', 'train_1000g_omni': '../variation/1000G_omni2.5.vcf', 'train_hapmap': '../variation/hapmap_3.3.vcf', 'train_indels': '../variation/Mills_and_1000G_gold_standard.indels.vcf'}, data={'analysis': 'variant2', 'callable_bam': '/work/prealign/BIN1_1/WES1.join.lsc.BIN1_1-1-1.bam', 'combine': {'work_bam': {'extras': ['/work/bampre...S1.join.lsc.BIN1_1-1-1-1_1569791_3160760-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_3297357_5876029-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_5923272_7528025-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_7700384_9305666-prep.bam', '/work/bampre...1.join.lsc.BIN1_1-1-1-1_9306961_11008911-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_11009653_12628510-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_12632745_14220310-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_14506886_16091783-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_16093812_17662772-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_17664469_19235227-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_19236778_20809310-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_20809548_22379276-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_22404921_24018401-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_24019021_25599232-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_25608351_27177172-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_27177569_28765018-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_28784253_30357958-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_31185983_32757878-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_32768177_34374551-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_34383646_35972774-prep.bam', ...], 'out': '/work/bamprep/BIN1_1/WES1.join.lsc.BIN1_1-1-1-prep.bam'}}, 'config': {'algorithm': {'aligner': False, 'callable_regions': '/work/analysis_blocks.bed', 'coverage_depth': 'high', 'coverage_interval': 'exome', 'mark_duplicates': 'picard', 'num_cores': 1, 'orig_variantcaller': ['gatk', 'freebayes'], 'platform': 'SOLiD', 'quality_format': 'Standard', 'realign': 'gatk', ...}, 'resources': {'AlienTrimmer': {'dir': '/usr/local/share/bcbio-nextgen/share/java/AlienTrimmer', 'jvm_opts': ['-Xms750m', '-Xmx2000m']}, 'alientrimmer': {'dir': '/usr/local/share/bcbio-nextgen/share/java/AlienTrimmer', 'jvm_opts': ['-Xms750m', '-Xmx2000m']}, 'bcbio_variation': {'dir': '/usr/local/share/bcbio-nextgen/share/java/bcbio_variation', 'jvm_opts': ['-Xms750m', '-Xmx2500m']}, 'bowtie': {'cores': 16}, 'bowtie2': {'cores': 16}, 'bwa': {'cmd': 'bwa', 'cores': 16}, 'cufflinks': {'cores': 16}, 'freebayes': {'memory': '2g'}, 'gatk': {'dir': '/usr/local/share/bcbio-nextgen/toolplus/gatk/3.1-1-g07a4bf8', 'jvm_opts': ['-Xms750m', '-Xmx2500m']}, 'gatk-haplotype': {'jvm_opts': ['-Xms2g', '-Xmx5500m']}, ...}}, 'description': 'BIN1_1', 'dirs': {'fastq': None, 'galaxy': '/usr/local/share/bcbio-nextgen/galaxy', 'work': '/work'}, 'files': ['/input/EXOME_1/WES1.join.lsc.BIN1_1-1-1.bam'], 'genome_build': 'GRCh37wd', 'genome_resources': {'aliases': {'ensembl': 'human', 'human': True, 'snpeff': 'GRCh37.74'}, 'rnaseq': {'transcriptome_index': {'tophat': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/tophat/GRCh37_transcriptome.ver'}, 'transcripts': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/ref-transcripts.gtf', 'transcripts_mask': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/ref-transcripts-mask.gtf'}, 'variation': {'cosmic': '../variation/cosmic-v67_20131024-GRCh37.vcf', 'dbsnp': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/variation/dbsnp_138.vcf', 'train_1000g_omni': '../variation/1000G_omni2.5.vcf', 'train_hapmap': '../variation/hapmap_3.3.vcf', 'train_indels': '../variation/Mills_and_1000G_gold_standard.indels.vcf'}, 'version': 7}, 'lane': '1', ...})
107 """Filter variant calls using Variant Quality Score Recalibration.
108
109 Newer GATK with Haplotype calling has combined SNP/indel filtering.
110 """
111 caller = data["config"]["algorithm"].get("variantcaller")
--> 112 call_file = ploidy.filter_vcf_by_sex(call_file, data)
113 if caller in ["freebayes"]:
114 return vfilter.freebayes(call_file, ref_file, vrn_files, data)
115 # no additional filtration for callers that filter as part of call process
116 elif caller in ["samtools", "varscan", "mutect"]:
...........................................................................
/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/ploidy.pyc in filter_vcf_by_sex(vcf_file='/work/gatk/WES1.join.lsc.BIN1_1-1-1-variants.vcf.gz', data={'analysis': 'variant2', 'callable_bam': '/work/prealign/BIN1_1/WES1.join.lsc.BIN1_1-1-1.bam', 'combine': {'work_bam': {'extras': ['/work/bampre...S1.join.lsc.BIN1_1-1-1-1_1569791_3160760-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_3297357_5876029-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_5923272_7528025-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_7700384_9305666-prep.bam', '/work/bampre...1.join.lsc.BIN1_1-1-1-1_9306961_11008911-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_11009653_12628510-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_12632745_14220310-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_14506886_16091783-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_16093812_17662772-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_17664469_19235227-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_19236778_20809310-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_20809548_22379276-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_22404921_24018401-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_24019021_25599232-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_25608351_27177172-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_27177569_28765018-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_28784253_30357958-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_31185983_32757878-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_32768177_34374551-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_34383646_35972774-prep.bam', ...], 'out': '/work/bamprep/BIN1_1/WES1.join.lsc.BIN1_1-1-1-prep.bam'}}, 'config': {'algorithm': {'aligner': False, 'callable_regions': '/work/analysis_blocks.bed', 'coverage_depth': 'high', 'coverage_interval': 'exome', 'mark_duplicates': 'picard', 'num_cores': 1, 'orig_variantcaller': ['gatk', 'freebayes'], 'platform': 'SOLiD', 'quality_format': 'Standard', 'realign': 'gatk', ...}, 'resources': {'AlienTrimmer': {'dir': '/usr/local/share/bcbio-nextgen/share/java/AlienTrimmer', 'jvm_opts': ['-Xms750m', '-Xmx2000m']}, 'alientrimmer': {'dir': '/usr/local/share/bcbio-nextgen/share/java/AlienTrimmer', 'jvm_opts': ['-Xms750m', '-Xmx2000m']}, 'bcbio_variation': {'dir': '/usr/local/share/bcbio-nextgen/share/java/bcbio_variation', 'jvm_opts': ['-Xms750m', '-Xmx2500m']}, 'bowtie': {'cores': 16}, 'bowtie2': {'cores': 16}, 'bwa': {'cmd': 'bwa', 'cores': 16}, 'cufflinks': {'cores': 16}, 'freebayes': {'memory': '2g'}, 'gatk': {'dir': '/usr/local/share/bcbio-nextgen/toolplus/gatk/3.1-1-g07a4bf8', 'jvm_opts': ['-Xms750m', '-Xmx2500m']}, 'gatk-haplotype': {'jvm_opts': ['-Xms2g', '-Xmx5500m']}, ...}}, 'description': 'BIN1_1', 'dirs': {'fastq': None, 'galaxy': '/usr/local/share/bcbio-nextgen/galaxy', 'work': '/work'}, 'files': ['/input/EXOME_1/WES1.join.lsc.BIN1_1-1-1.bam'], 'genome_build': 'GRCh37wd', 'genome_resources': {'aliases': {'ensembl': 'human', 'human': True, 'snpeff': 'GRCh37.74'}, 'rnaseq': {'transcriptome_index': {'tophat': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/tophat/GRCh37_transcriptome.ver'}, 'transcripts': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/ref-transcripts.gtf', 'transcripts_mask': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/ref-transcripts-mask.gtf'}, 'variation': {'cosmic': '../variation/cosmic-v67_20131024-GRCh37.vcf', 'dbsnp': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/variation/dbsnp_138.vcf', 'train_1000g_omni': '../variation/1000G_omni2.5.vcf', 'train_hapmap': '../variation/hapmap_3.3.vcf', 'train_indels': '../variation/Mills_and_1000G_gold_standard.indels.vcf'}, 'version': 7}, 'lane': '1', ...})
104 with utils.open_gzipsafe(vcf_file) as in_handle:
105 for line in in_handle:
106 if line.startswith("#"):
107 out_handle.write(line)
108 else:
--> 109 line = _fix_line_ploidy(line, sex)
110 if line:
111 out_handle.write(line)
112 if orig_out_file.endswith(".gz"):
113 out_file = vcfutils.bgzip_and_index(out_file, data["config"])
...........................................................................
/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/ploidy.pyc in _fix_line_ploidy(line='Y\t8149244\trs377384336\tA\tG\t331.2\t.\tAC=1;AF=1;AN=1...2;MQ0=0;MQRankSum=0;QD=33.12;ReadPosRankSum=0;DB\n', sex='')
76 return _to_haploid(parts)
77 else:
78 return line
79 elif chrom == "Y":
80 if sex != "female":
---> 81 return _to_haploid(parts)
82 else:
83 return line
84
85 def filter_vcf_by_sex(vcf_file, data):
...........................................................................
/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/ploidy.pyc in _to_haploid(parts=['Y', '8149244', 'rs377384336', 'A', 'G', '331.2', '.', 'AC=1;AF=1;AN=1;BaseQRankSum=1.393;DP=10;Dels=0;F...2;MQ0=0;MQRankSum=0;QD=33.12;ReadPosRankSum=0;DB\n'])
54 """Check if a variant call is homozygous variant, convert to haploid.
55 XXX Needs generalization or use of a standard VCF library.
56 """
57 finfo = dict(zip(parts[-2].split(":"), parts[-1].strip().split(":")))
58 pat = re.compile(r"\||/")
---> 59 calls = set(pat.split(finfo["GT"]))
60 if len(calls) == 1:
61 gt_index = parts[-2].split(":").index("GT")
I missed the last line:
KeyError: 'GT'
Sorry about the problem and thanks for the detailed report. It looks like this is an issue with trying to correctly handle ploidy conversions. The code was not considering the no-call case where there is no genotype present.
I pushed a fix to avoid this issue so if you update again it should hopefully work cleanly. Thanks again for the patience and please let us know if you run into any other issues.
No problem, I really appreciate your prompt responses and fixes. So far the run has finished without further errors. However there are some questions about the final output. I am not sure if it is a good idea to continue posting here. Please decide and feel free to create a new issue based on these questions yourself.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2
Currently, it looks like this:
#CHROM POS ID REF ALT QUAL FILTER INFO
It seems that GATK and Snpeff's annotation process have ovewritten the information(?). More importantly, I ran several samples together and got back the results which do not have any information about samples, I can't use the output for further interpretation. The two following problems are less urgent to me because I do not use those output anytime soon. It's no problem to follow up here -- our main goal is to solve your problems and get some variant calls.
This is definitely unexpected and seems like something went very wrong. You've lost your sample variant calls at some point but I'm not sure why. The ploidy error above looks like it was indicative of the problem.
The best way to debug is to trace back and see where the calls were lost. If you look into gatk/*-sort-variants.tar.gz
and freebayes/*-sort-variants.tar.gz
, do those files have sample genotype information, or are they missing? If those are missing, you can look in the individual batch calls before combining (freebayes/chr22/*WES1_Joined.vcf.gz
) to see if those have sample calls.
My guess is that the sample names in the BAM files do not match the sample description
in your YAML file, and you're losing the samples when selecting them from the batch file. If that's the case, you can either adjust these names to match or use bam_clean: picard
(https://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#alignment) to fix the names in the BAM file before doing the downstream processing.
Hope this explains things. If so, I could add a check to raise an error earlier on in the process to avoid running the whole thing and not getting useful output. Sorry again about the issue.
I looked at the output final
directory. I don't have *-sort-variants.tar.gz
files. However freebayes/chr22/*WES1_Joined.vcf.gz
does have columns for samples
. I have tried reruninng the whole run with different work
directory, but same YAML
file. The output still looks the same. I put in bam_clean: picard
and rerunning in another work
dir now. I will update when the run finishes.
I hope the bam_clean: picard
resolved the issue. I added in a check to the pre-aligned BAM that should catch this case up front instead of resulting in confusing output. Thanks again for the report and hope your new calls look good.
You are absolutely right about the differences between sample names in YAML file and in BAM file. I tooked the BAM's IDs instead of SMs to put in the YAML config. My bad. It's very nice that the pipline now provides checks on this. It's getting more and more fool-proof :-) Thanks again for the work.
Hello Brad, I tried doing variant calling with gatk and freebayes in the same analysis run using bcbio-nextgen v. 0.7.8. Content of line number
49410
mentioned at the end of the error message is this:hs37d5 31708115 . R G 282.863 . AC=8;AF=1;LEN=1;TYPE=snp GT 1|1 1|1 1|1 1|1
I still haven't figured out what to change in the run YAML file to make Freebayes output compatitble with GATK.