bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
986 stars 354 forks source link

Freebayes output incompatible with GATK #370

Closed biocyberman closed 10 years ago

biocyberman commented 10 years ago

Hello Brad, I tried doing variant calling with gatk and freebayes in the same analysis run using bcbio-nextgen v. 0.7.8. Content of line number 49410 mentioned at the end of the error message is this: hs37d5 31708115 . R G 282.863 . AC=8;AF=1;LEN=1;TYPE=snp GT 1|1 1|1 1|1 1|1 I still haven't figured out what to change in the run YAML file to make Freebayes output compatitble with GATK.

subprocess.CalledProcessError: Command 'java -Xms750m -Xmx2500m -jar GenomeAnalysisTK.jar 
-T VariantAnnotator 
-R /data/genomes/Hsapiens/GRCh37wd/seq/GRCh37wd.fa 
--variant /work/freebayes/hs37d5/freebayes-out.vcf.gz 
--dbsnp /usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/variation/dbsnp_138.vcf 
--out /work/freebayes/hs37d5/tx/tmplNF1GP/freebayes-out-gatkann.vcf.gz 
-L /work/freebayes/hs37d5/freebayes-out.vcf.gz 
-I /work/bamprep/BIN1_1/hs37d5/WES1.join.lsc.BIN1_1-1-1-hs37d5_0_35477943-prep.bam 
-I /work/bamprep/BIN1_2/hs37d5/WES1.join.lsc.BIN1_2-2-1-hs37d5_0_35477943-prep.bam 
-I /work/bamprep/BIN1_3/hs37d5/WES1.join.lsc.BIN1_3-3-1-hs37d5_0_35477943-prep.bam 
-I /work/bamprep/BIN1_4/hs37d5/WES1.join.lsc.BIN1_4-4-1-hs37d5_0_35477943-prep.bam 
-A BaseQualityRankSumTest -A FisherStrand -A GCContent -A HaplotypeScore -A HomopolymerRun -A MappingQualityRankSumTest 
-A MappingQualityZero -A QualByDepth -A ReadPosRankSumTest -A RMSMappingQuality -A DepthPerAlleleBySample -A Coverage 
-U LENIENT_VCF_PROCESSING --read_filter BadCigar --read_filter NotPrimaryAlignment'
INFO  10:59:38,795 HelpFormatter - Executing as user@localhost on Linux 3.5.0-45-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13. 
INFO  10:59:38,795 HelpFormatter - Date/Time: 2014/03/23 10:59:38 
INFO  10:59:38,795 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  10:59:38,795 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  10:59:39,488 GenomeAnalysisEngine - Strictness is SILENT 
INFO  10:59:39,709 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 
INFO  10:59:39,725 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  10:59:39,869 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.14 
INFO  10:59:43,156 GATKRunReport - Uploaded run statistics report to AWS S3 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.1-1-g07a4bf8): 
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: File associated with name /work/freebayes/hs37d5/freebayes-out.vcf.gz is malformed: Problem reading the interval file caused by The provided VCF file is malformed at approximately line number 49410: unparsable vcf record with allele R, for input source: /work/freebayes/hs37d5/freebayes-out.vcf.gz
biocyberman commented 10 years ago

This is one entry of my YAML file:

- algorithm:
    aligner: false
    coverage_depth: high
    coverage_interval: exome
    mark_duplicates: picard
    platform: SOLiD
    quality_format: Standard
    realign: gatk
    recalibrate: gatk
    variant_regions: /input/SeqCap_EZ_Exome_v3_primary_GRCh37.bed
    variantcaller: [gatk,freebayes]
  analysis: variant2
  metadata:
    batch: WES1_Joined
  description: BIN1_1
  files: ../input/EXOME_1/WES1.join.lsc.BIN1_1-1-1.bam
  genome_build: GRCh37wd
biocyberman commented 10 years ago

According to this thread on GATK forum, http://gatkforums.broadinstitute.org/discussion/3382/no-error-or-warning-for-ambiguous-bases: GATK ignores variant calling on ambiguous reference bases. In my case, using Freebayes does not seem to do the same. So Freebayes calls variant for the R reference base. This output then is processed with GATK and hence the error.

chapmanb commented 10 years ago

Thanks much for the report. You're right on, this is an incompatibility between FreeBayes and GATK's view of the world. FreeBayes will call on these non-N ambiguous bases where any GATK post-processing steps won't like this.

To practically solve this issue, are you trying to get calls in the decoy regions? If not, you should be able to adjust your variant_regions file to not have intervals in hs37d5 and then it won't try to call there. I haven't run into these style of ambiguous bases outside of the decoy inputs.

If you need calls in hs37d5, then we could try to post-filter the FreeBayes output to avoid these issues but I'd rather not add in additional sed style steps if we could avoid it.

What do you think?

biocyberman commented 10 years ago

Thanks for taking a look in to the issue. I actually do not have hs37d5 in the variantregions because the bed file only covers amplicon regions of canonical chromosomes in UCSC's hg19 release. It does not even have additional contigs GL0, NC__ of GRCh37 distributed via Broad's website. Your comment brought my attention to analysis_blocks.bed. This file is generated during bcbio-nextgen run. It does not follow strictly intervals in the variant_regions files and adds bonus chromosomes/contigs, each whole contig as one interval. For example:

GL000225.1      0       211173
GL000192.1      0       547496
NC_007605       0       171823
hs37d5  0       35477943

This behavior may be nice on other cases, but not desirable in my case. I am thinking that it is possible to make a temporary fix like you suggested, but to the analysis_blocks.bed file instead.

chapmanb commented 10 years ago

Thanks for the follow up. My apologies, I tested this on a small subset and it was excluding these extra haplotype chromosomes if they were not present in the variant_regions file, but there was a bug when doing it on whole genomes. I pushed a fix, so if you update with bcbio_nextgen.py upgrade -u development, remove the old analysis regions file ( rm analysis_regions.bed) and re-run, it should now correctly exclude variant calling from these regions and avoid the issue.

Thanks again for the help debugging and let me know if you run into any other problems.

biocyberman commented 10 years ago

Good news is that the run could pass the point it stopped before. But I encounter one more problem still. It has something to do with GATK expectations in its new version:

/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/utils.pyc in wrapper(*args=([{'analysis': 'variant2', 'callable_bam': '/work/prealign/BIN1_1/WES1.join.lsc.BIN1_1-1-1.bam', 'combine': {'work_bam': {'extras': ['/work/bampre...S1.join.lsc.BIN1_1-1-1-1_1569791_3160760-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_3297357_5876029-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_5923272_7528025-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_7700384_9305666-prep.bam', '/work/bampre...1.join.lsc.BIN1_1-1-1-1_9306961_11008911-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_11009653_12628510-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_12632745_14220310-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_14506886_16091783-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_16093812_17662772-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_17664469_19235227-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_19236778_20809310-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_20809548_22379276-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_22404921_24018401-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_24019021_25599232-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_25608351_27177172-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_27177569_28765018-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_28784253_30357958-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_31185983_32757878-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_32768177_34374551-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_34383646_35972774-prep.bam', ...], 'out': '/work/bamprep/BIN1_1/WES1.join.lsc.BIN1_1-1-1-prep.bam'}}, 'config': {'algorithm': {'aligner': False, 'callable_regions': '/work/analysis_blocks.bed', 'coverage_depth': 'high', 'coverage_interval': 'exome', 'mark_duplicates': 'picard', 'num_cores': 1, 'orig_variantcaller': ['gatk', 'freebayes'], 'platform': 'SOLiD', 'quality_format': 'Standard', 'realign': 'gatk', ...}, 'resources': {'AlienTrimmer': {'dir': '/usr/local/share/bcbio-nextgen/share/java/AlienTrimmer', 'jvm_opts': [...]}, 'alientrimmer': {'dir': '/usr/local/share/bcbio-nextgen/share/java/AlienTrimmer', 'jvm_opts': [...]}, 'bcbio_variation': {'dir': '/usr/local/share/bcbio-nextgen/share/java/bcbio_variation', 'jvm_opts': [...]}, 'bowtie': {'cores': 16}, 'bowtie2': {'cores': 16}, 'bwa': {'cmd': 'bwa', 'cores': 16}, 'cufflinks': {'cores': 16}, 'freebayes': {'memory': '2g'}, 'gatk': {'dir': '/usr/local/share/bcbio-nextgen/toolplus/gatk/3.1-1-g07a4bf8', 'jvm_opts': [...]}, 'gatk-haplotype': {'jvm_opts': [...]}, ...}}, 'description': 'BIN1_1', 'dirs': {'fastq': None, 'galaxy': '/usr/local/share/bcbio-nextgen/galaxy', 'work': '/work'}, 'files': ['/input/EXOME_1/WES1.join.lsc.BIN1_1-1-1.bam'], 'genome_build': 'GRCh37wd', 'genome_resources': {'aliases': {'ensembl': 'human', 'human': True, 'snpeff': 'GRCh37.74'}, 'rnaseq': {'transcriptome_index': {'tophat': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/tophat/GRCh37_transcriptome.ver'}, 'transcripts': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/ref-transcripts.gtf', 'transcripts_mask': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/ref-transcripts-mask.gtf'}, 'variation': {'cosmic': '../variation/cosmic-v67_20131024-GRCh37.vcf', 'dbsnp': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/variation/dbsnp_138.vcf', 'train_1000g_omni': '../variation/1000G_omni2.5.vcf', 'train_hapmap': '../variation/hapmap_3.3.vcf', 'train_indels': '../variation/Mills_and_1000G_gold_standard.indels.vcf'}, 'version': 7}, 'lane': '1', ...}],), **kwargs={})
     42 def map_wrap(f):
     43     """Wrap standard function to easily pass into 'map' processing.
     44     """
     45     @functools.wraps(f)
     46     def wrapper(*args, **kwargs):
---> 47         return apply(f, *args, **kwargs)
     48     return wrapper
     49 
     50 
     51 def transform_to(ext):

...........................................................................
/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.pyc in postprocess_variants(*args=({'analysis': 'variant2', 'callable_bam': '/work/prealign/BIN1_1/WES1.join.lsc.BIN1_1-1-1.bam', 'combine': {'work_bam': {'extras': ['/work/bampre...S1.join.lsc.BIN1_1-1-1-1_1569791_3160760-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_3297357_5876029-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_5923272_7528025-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_7700384_9305666-prep.bam', '/work/bampre...1.join.lsc.BIN1_1-1-1-1_9306961_11008911-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_11009653_12628510-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_12632745_14220310-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_14506886_16091783-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_16093812_17662772-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_17664469_19235227-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_19236778_20809310-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_20809548_22379276-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_22404921_24018401-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_24019021_25599232-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_25608351_27177172-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_27177569_28765018-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_28784253_30357958-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_31185983_32757878-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_32768177_34374551-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_34383646_35972774-prep.bam', ...], 'out': '/work/bamprep/BIN1_1/WES1.join.lsc.BIN1_1-1-1-prep.bam'}}, 'config': {'algorithm': {'aligner': False, 'callable_regions': '/work/analysis_blocks.bed', 'coverage_depth': 'high', 'coverage_interval': 'exome', 'mark_duplicates': 'picard', 'num_cores': 1, 'orig_variantcaller': ['gatk', 'freebayes'], 'platform': 'SOLiD', 'quality_format': 'Standard', 'realign': 'gatk', ...}, 'resources': {'AlienTrimmer': {'dir': '/usr/local/share/bcbio-nextgen/share/java/AlienTrimmer', 'jvm_opts': ['-Xms750m', '-Xmx2000m']}, 'alientrimmer': {'dir': '/usr/local/share/bcbio-nextgen/share/java/AlienTrimmer', 'jvm_opts': ['-Xms750m', '-Xmx2000m']}, 'bcbio_variation': {'dir': '/usr/local/share/bcbio-nextgen/share/java/bcbio_variation', 'jvm_opts': ['-Xms750m', '-Xmx2500m']}, 'bowtie': {'cores': 16}, 'bowtie2': {'cores': 16}, 'bwa': {'cmd': 'bwa', 'cores': 16}, 'cufflinks': {'cores': 16}, 'freebayes': {'memory': '2g'}, 'gatk': {'dir': '/usr/local/share/bcbio-nextgen/toolplus/gatk/3.1-1-g07a4bf8', 'jvm_opts': ['-Xms750m', '-Xmx2500m']}, 'gatk-haplotype': {'jvm_opts': ['-Xms2g', '-Xmx5500m']}, ...}}, 'description': 'BIN1_1', 'dirs': {'fastq': None, 'galaxy': '/usr/local/share/bcbio-nextgen/galaxy', 'work': '/work'}, 'files': ['/input/EXOME_1/WES1.join.lsc.BIN1_1-1-1.bam'], 'genome_build': 'GRCh37wd', 'genome_resources': {'aliases': {'ensembl': 'human', 'human': True, 'snpeff': 'GRCh37.74'}, 'rnaseq': {'transcriptome_index': {'tophat': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/tophat/GRCh37_transcriptome.ver'}, 'transcripts': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/ref-transcripts.gtf', 'transcripts_mask': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/ref-transcripts-mask.gtf'}, 'variation': {'cosmic': '../variation/cosmic-v67_20131024-GRCh37.vcf', 'dbsnp': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/variation/dbsnp_138.vcf', 'train_1000g_omni': '../variation/1000G_omni2.5.vcf', 'train_hapmap': '../variation/hapmap_3.3.vcf', 'train_indels': '../variation/Mills_and_1000G_gold_standard.indels.vcf'}, 'version': 7}, 'lane': '1', ...},))
     56 def split_variants_by_sample(*args):
     57     return multi.split_variants_by_sample(*args)
     58 
     59 @utils.map_wrap
     60 def postprocess_variants(*args):
---> 61     return variation.postprocess_variants(*args)
     62 
     63 @utils.map_wrap
     64 def pipeline_summary(*args):
     65     return qcsummary.pipeline_summary(*args)

...........................................................................
/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/variation.pyc in postprocess_variants(data={'analysis': 'variant2', 'callable_bam': '/work/prealign/BIN1_1/WES1.join.lsc.BIN1_1-1-1.bam', 'combine': {'work_bam': {'extras': ['/work/bampre...S1.join.lsc.BIN1_1-1-1-1_1569791_3160760-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_3297357_5876029-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_5923272_7528025-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_7700384_9305666-prep.bam', '/work/bampre...1.join.lsc.BIN1_1-1-1-1_9306961_11008911-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_11009653_12628510-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_12632745_14220310-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_14506886_16091783-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_16093812_17662772-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_17664469_19235227-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_19236778_20809310-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_20809548_22379276-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_22404921_24018401-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_24019021_25599232-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_25608351_27177172-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_27177569_28765018-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_28784253_30357958-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_31185983_32757878-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_32768177_34374551-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_34383646_35972774-prep.bam', ...], 'out': '/work/bamprep/BIN1_1/WES1.join.lsc.BIN1_1-1-1-prep.bam'}}, 'config': {'algorithm': {'aligner': False, 'callable_regions': '/work/analysis_blocks.bed', 'coverage_depth': 'high', 'coverage_interval': 'exome', 'mark_duplicates': 'picard', 'num_cores': 1, 'orig_variantcaller': ['gatk', 'freebayes'], 'platform': 'SOLiD', 'quality_format': 'Standard', 'realign': 'gatk', ...}, 'resources': {'AlienTrimmer': {'dir': '/usr/local/share/bcbio-nextgen/share/java/AlienTrimmer', 'jvm_opts': ['-Xms750m', '-Xmx2000m']}, 'alientrimmer': {'dir': '/usr/local/share/bcbio-nextgen/share/java/AlienTrimmer', 'jvm_opts': ['-Xms750m', '-Xmx2000m']}, 'bcbio_variation': {'dir': '/usr/local/share/bcbio-nextgen/share/java/bcbio_variation', 'jvm_opts': ['-Xms750m', '-Xmx2500m']}, 'bowtie': {'cores': 16}, 'bowtie2': {'cores': 16}, 'bwa': {'cmd': 'bwa', 'cores': 16}, 'cufflinks': {'cores': 16}, 'freebayes': {'memory': '2g'}, 'gatk': {'dir': '/usr/local/share/bcbio-nextgen/toolplus/gatk/3.1-1-g07a4bf8', 'jvm_opts': ['-Xms750m', '-Xmx2500m']}, 'gatk-haplotype': {'jvm_opts': ['-Xms2g', '-Xmx5500m']}, ...}}, 'description': 'BIN1_1', 'dirs': {'fastq': None, 'galaxy': '/usr/local/share/bcbio-nextgen/galaxy', 'work': '/work'}, 'files': ['/input/EXOME_1/WES1.join.lsc.BIN1_1-1-1.bam'], 'genome_build': 'GRCh37wd', 'genome_resources': {'aliases': {'ensembl': 'human', 'human': True, 'snpeff': 'GRCh37.74'}, 'rnaseq': {'transcriptome_index': {'tophat': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/tophat/GRCh37_transcriptome.ver'}, 'transcripts': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/ref-transcripts.gtf', 'transcripts_mask': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/ref-transcripts-mask.gtf'}, 'variation': {'cosmic': '../variation/cosmic-v67_20131024-GRCh37.vcf', 'dbsnp': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/variation/dbsnp_138.vcf', 'train_1000g_omni': '../variation/1000G_omni2.5.vcf', 'train_hapmap': '../variation/hapmap_3.3.vcf', 'train_indels': '../variation/Mills_and_1000G_gold_standard.indels.vcf'}, 'version': 7}, 'lane': '1', ...})
     12     cur_name = "%s, %s" % (data["name"][-1], get_variantcaller(data))
     13     logger.info("Finalizing variant calls: %s" % cur_name)
     14     if data["work_bam"] and data.get("vrn_file"):
     15         data["vrn_file"] = variant_filtration(data["vrn_file"], data["sam_ref"],
     16                                               data["genome_resources"]["variation"],
---> 17                                               data)
     18         logger.info("Calculating variation effects for %s" % cur_name)
     19         ann_vrn_file = effects.snpeff_effects(data)
     20         if ann_vrn_file:
     21             data["vrn_file"] = ann_vrn_file

...........................................................................
/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/genotype.pyc in variant_filtration(call_file='/work/gatk/WES1.join.lsc.BIN1_1-1-1-variants.vcf.gz', ref_file='/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/seq/GRCh37wd.fa', vrn_files={'cosmic': '../variation/cosmic-v67_20131024-GRCh37.vcf', 'dbsnp': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/variation/dbsnp_138.vcf', 'train_1000g_omni': '../variation/1000G_omni2.5.vcf', 'train_hapmap': '../variation/hapmap_3.3.vcf', 'train_indels': '../variation/Mills_and_1000G_gold_standard.indels.vcf'}, data={'analysis': 'variant2', 'callable_bam': '/work/prealign/BIN1_1/WES1.join.lsc.BIN1_1-1-1.bam', 'combine': {'work_bam': {'extras': ['/work/bampre...S1.join.lsc.BIN1_1-1-1-1_1569791_3160760-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_3297357_5876029-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_5923272_7528025-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_7700384_9305666-prep.bam', '/work/bampre...1.join.lsc.BIN1_1-1-1-1_9306961_11008911-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_11009653_12628510-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_12632745_14220310-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_14506886_16091783-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_16093812_17662772-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_17664469_19235227-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_19236778_20809310-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_20809548_22379276-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_22404921_24018401-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_24019021_25599232-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_25608351_27177172-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_27177569_28765018-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_28784253_30357958-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_31185983_32757878-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_32768177_34374551-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_34383646_35972774-prep.bam', ...], 'out': '/work/bamprep/BIN1_1/WES1.join.lsc.BIN1_1-1-1-prep.bam'}}, 'config': {'algorithm': {'aligner': False, 'callable_regions': '/work/analysis_blocks.bed', 'coverage_depth': 'high', 'coverage_interval': 'exome', 'mark_duplicates': 'picard', 'num_cores': 1, 'orig_variantcaller': ['gatk', 'freebayes'], 'platform': 'SOLiD', 'quality_format': 'Standard', 'realign': 'gatk', ...}, 'resources': {'AlienTrimmer': {'dir': '/usr/local/share/bcbio-nextgen/share/java/AlienTrimmer', 'jvm_opts': ['-Xms750m', '-Xmx2000m']}, 'alientrimmer': {'dir': '/usr/local/share/bcbio-nextgen/share/java/AlienTrimmer', 'jvm_opts': ['-Xms750m', '-Xmx2000m']}, 'bcbio_variation': {'dir': '/usr/local/share/bcbio-nextgen/share/java/bcbio_variation', 'jvm_opts': ['-Xms750m', '-Xmx2500m']}, 'bowtie': {'cores': 16}, 'bowtie2': {'cores': 16}, 'bwa': {'cmd': 'bwa', 'cores': 16}, 'cufflinks': {'cores': 16}, 'freebayes': {'memory': '2g'}, 'gatk': {'dir': '/usr/local/share/bcbio-nextgen/toolplus/gatk/3.1-1-g07a4bf8', 'jvm_opts': ['-Xms750m', '-Xmx2500m']}, 'gatk-haplotype': {'jvm_opts': ['-Xms2g', '-Xmx5500m']}, ...}}, 'description': 'BIN1_1', 'dirs': {'fastq': None, 'galaxy': '/usr/local/share/bcbio-nextgen/galaxy', 'work': '/work'}, 'files': ['/input/EXOME_1/WES1.join.lsc.BIN1_1-1-1.bam'], 'genome_build': 'GRCh37wd', 'genome_resources': {'aliases': {'ensembl': 'human', 'human': True, 'snpeff': 'GRCh37.74'}, 'rnaseq': {'transcriptome_index': {'tophat': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/tophat/GRCh37_transcriptome.ver'}, 'transcripts': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/ref-transcripts.gtf', 'transcripts_mask': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/ref-transcripts-mask.gtf'}, 'variation': {'cosmic': '../variation/cosmic-v67_20131024-GRCh37.vcf', 'dbsnp': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/variation/dbsnp_138.vcf', 'train_1000g_omni': '../variation/1000G_omni2.5.vcf', 'train_hapmap': '../variation/hapmap_3.3.vcf', 'train_indels': '../variation/Mills_and_1000G_gold_standard.indels.vcf'}, 'version': 7}, 'lane': '1', ...})
    107     """Filter variant calls using Variant Quality Score Recalibration.
    108 
    109     Newer GATK with Haplotype calling has combined SNP/indel filtering.
    110     """
    111     caller = data["config"]["algorithm"].get("variantcaller")
--> 112     call_file = ploidy.filter_vcf_by_sex(call_file, data)
    113     if caller in ["freebayes"]:
    114         return vfilter.freebayes(call_file, ref_file, vrn_files, data)
    115     # no additional filtration for callers that filter as part of call process
    116     elif caller in ["samtools", "varscan", "mutect"]:

...........................................................................
/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/ploidy.pyc in filter_vcf_by_sex(vcf_file='/work/gatk/WES1.join.lsc.BIN1_1-1-1-variants.vcf.gz', data={'analysis': 'variant2', 'callable_bam': '/work/prealign/BIN1_1/WES1.join.lsc.BIN1_1-1-1.bam', 'combine': {'work_bam': {'extras': ['/work/bampre...S1.join.lsc.BIN1_1-1-1-1_1569791_3160760-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_3297357_5876029-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_5923272_7528025-prep.bam', '/work/bampre...S1.join.lsc.BIN1_1-1-1-1_7700384_9305666-prep.bam', '/work/bampre...1.join.lsc.BIN1_1-1-1-1_9306961_11008911-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_11009653_12628510-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_12632745_14220310-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_14506886_16091783-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_16093812_17662772-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_17664469_19235227-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_19236778_20809310-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_20809548_22379276-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_22404921_24018401-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_24019021_25599232-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_25608351_27177172-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_27177569_28765018-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_28784253_30357958-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_31185983_32757878-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_32768177_34374551-prep.bam', '/work/bampre....join.lsc.BIN1_1-1-1-1_34383646_35972774-prep.bam', ...], 'out': '/work/bamprep/BIN1_1/WES1.join.lsc.BIN1_1-1-1-prep.bam'}}, 'config': {'algorithm': {'aligner': False, 'callable_regions': '/work/analysis_blocks.bed', 'coverage_depth': 'high', 'coverage_interval': 'exome', 'mark_duplicates': 'picard', 'num_cores': 1, 'orig_variantcaller': ['gatk', 'freebayes'], 'platform': 'SOLiD', 'quality_format': 'Standard', 'realign': 'gatk', ...}, 'resources': {'AlienTrimmer': {'dir': '/usr/local/share/bcbio-nextgen/share/java/AlienTrimmer', 'jvm_opts': ['-Xms750m', '-Xmx2000m']}, 'alientrimmer': {'dir': '/usr/local/share/bcbio-nextgen/share/java/AlienTrimmer', 'jvm_opts': ['-Xms750m', '-Xmx2000m']}, 'bcbio_variation': {'dir': '/usr/local/share/bcbio-nextgen/share/java/bcbio_variation', 'jvm_opts': ['-Xms750m', '-Xmx2500m']}, 'bowtie': {'cores': 16}, 'bowtie2': {'cores': 16}, 'bwa': {'cmd': 'bwa', 'cores': 16}, 'cufflinks': {'cores': 16}, 'freebayes': {'memory': '2g'}, 'gatk': {'dir': '/usr/local/share/bcbio-nextgen/toolplus/gatk/3.1-1-g07a4bf8', 'jvm_opts': ['-Xms750m', '-Xmx2500m']}, 'gatk-haplotype': {'jvm_opts': ['-Xms2g', '-Xmx5500m']}, ...}}, 'description': 'BIN1_1', 'dirs': {'fastq': None, 'galaxy': '/usr/local/share/bcbio-nextgen/galaxy', 'work': '/work'}, 'files': ['/input/EXOME_1/WES1.join.lsc.BIN1_1-1-1.bam'], 'genome_build': 'GRCh37wd', 'genome_resources': {'aliases': {'ensembl': 'human', 'human': True, 'snpeff': 'GRCh37.74'}, 'rnaseq': {'transcriptome_index': {'tophat': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/tophat/GRCh37_transcriptome.ver'}, 'transcripts': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/ref-transcripts.gtf', 'transcripts_mask': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/rnaseq/ref-transcripts-mask.gtf'}, 'variation': {'cosmic': '../variation/cosmic-v67_20131024-GRCh37.vcf', 'dbsnp': '/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/variation/dbsnp_138.vcf', 'train_1000g_omni': '../variation/1000G_omni2.5.vcf', 'train_hapmap': '../variation/hapmap_3.3.vcf', 'train_indels': '../variation/Mills_and_1000G_gold_standard.indels.vcf'}, 'version': 7}, 'lane': '1', ...})
    104                 with utils.open_gzipsafe(vcf_file) as in_handle:
    105                     for line in in_handle:
    106                         if line.startswith("#"):
    107                             out_handle.write(line)
    108                         else:
--> 109                             line = _fix_line_ploidy(line, sex)
    110                             if line:
    111                                 out_handle.write(line)
    112         if orig_out_file.endswith(".gz"):
    113             out_file = vcfutils.bgzip_and_index(out_file, data["config"])

...........................................................................
/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/ploidy.pyc in _fix_line_ploidy(line='Y\t8149244\trs377384336\tA\tG\t331.2\t.\tAC=1;AF=1;AN=1...2;MQ0=0;MQRankSum=0;QD=33.12;ReadPosRankSum=0;DB\n', sex='')
     76             return _to_haploid(parts)
     77         else:
     78             return line
     79     elif chrom == "Y":
     80         if sex != "female":
---> 81             return _to_haploid(parts)
     82     else:
     83         return line
     84 
     85 def filter_vcf_by_sex(vcf_file, data):

...........................................................................
/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/ploidy.pyc in _to_haploid(parts=['Y', '8149244', 'rs377384336', 'A', 'G', '331.2', '.', 'AC=1;AF=1;AN=1;BaseQRankSum=1.393;DP=10;Dels=0;F...2;MQ0=0;MQRankSum=0;QD=33.12;ReadPosRankSum=0;DB\n'])
     54     """Check if a variant call is homozygous variant, convert to haploid.
     55     XXX Needs generalization or use of a standard VCF library.
     56     """
     57     finfo = dict(zip(parts[-2].split(":"), parts[-1].strip().split(":")))
     58     pat = re.compile(r"\||/")
---> 59     calls = set(pat.split(finfo["GT"]))
     60     if len(calls) == 1:
     61         gt_index = parts[-2].split(":").index("GT")
biocyberman commented 10 years ago

I missed the last line: KeyError: 'GT'

chapmanb commented 10 years ago

Sorry about the problem and thanks for the detailed report. It looks like this is an issue with trying to correctly handle ploidy conversions. The code was not considering the no-call case where there is no genotype present.

I pushed a fix to avoid this issue so if you update again it should hopefully work cleanly. Thanks again for the patience and please let us know if you run into any other issues.

biocyberman commented 10 years ago

No problem, I really appreciate your prompt responses and fixes. So far the run has finished without further errors. However there are some questions about the final output. I am not sure if it is a good idea to continue posting here. Please decide and feel free to create a new issue based on these questions yourself.

  1. Sample names, sample specific data are missing in final VCF files. Previously, VCF table head was like: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2 Currently, it looks like this: #CHROM POS ID REF ALT QUAL FILTER INFO It seems that GATK and Snpeff's annotation process have ovewritten the information(?). More importantly, I ran several samples together and got back the results which do not have any information about samples, I can't use the output for further interpretation. The two following problems are less urgent to me because I do not use those output anytime soon.
  2. Suspect: project-summary.yaml, mettrics sections contains incorrect information. I got something like this: metrics: Transition/Transversion: '0' Variations (in dbSNP): 0 This is not correct, I believe.
  3. Empty SQLite files in final output. The databases files contains schema but contains no variant data.
chapmanb commented 10 years ago

It's no problem to follow up here -- our main goal is to solve your problems and get some variant calls.

This is definitely unexpected and seems like something went very wrong. You've lost your sample variant calls at some point but I'm not sure why. The ploidy error above looks like it was indicative of the problem.

The best way to debug is to trace back and see where the calls were lost. If you look into gatk/*-sort-variants.tar.gz and freebayes/*-sort-variants.tar.gz, do those files have sample genotype information, or are they missing? If those are missing, you can look in the individual batch calls before combining (freebayes/chr22/*WES1_Joined.vcf.gz) to see if those have sample calls.

My guess is that the sample names in the BAM files do not match the sample description in your YAML file, and you're losing the samples when selecting them from the batch file. If that's the case, you can either adjust these names to match or use bam_clean: picard (https://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#alignment) to fix the names in the BAM file before doing the downstream processing.

Hope this explains things. If so, I could add a check to raise an error earlier on in the process to avoid running the whole thing and not getting useful output. Sorry again about the issue.

biocyberman commented 10 years ago

I looked at the output final directory. I don't have *-sort-variants.tar.gz files. However freebayes/chr22/*WES1_Joined.vcf.gz does have columns for samples. I have tried reruninng the whole run with different work directory, but same YAML file. The output still looks the same. I put in bam_clean: picard and rerunning in another work dir now. I will update when the run finishes.

chapmanb commented 10 years ago

I hope the bam_clean: picard resolved the issue. I added in a check to the pre-aligned BAM that should catch this case up front instead of resulting in confusing output. Thanks again for the report and hope your new calls look good.

biocyberman commented 10 years ago

You are absolutely right about the differences between sample names in YAML file and in BAM file. I tooked the BAM's IDs instead of SMs to put in the YAML config. My bad. It's very nice that the pipline now provides checks on this. It's getting more and more fool-proof :-) Thanks again for the work.