vcf2db error for *germline-gatk-haplotype-nomultiallelic-annotated-gemini.vcf.gz

ahy1221 commented 6 years ago

Hi, I am using bcbio cancer germline and somatic mutation pipeline for my own exome data. The config for my pipeline is

details:
  - analysis: variant2
    genome_build: GRCh37
# In order to do paired variant calling, samples should belong to the
# same batch ("batch" under "metadata" below") and have a "phenotype"
# field stating either "normal" or "tumor". For each batch there
# should be a sample with "tumor" phenotype and a sample with "normal"
# phenotype (no more than two samples per batch)
    metadata:
        batch: your-batch-name
        phenotype: normal # or "normal"
    algorithm:
        aligner: bwa
        variantcaller: 
            somatic: [vardict, mutect2, strelka2]
            germline: [gatk-haplotype]
        ensemble:
            numpass: 2
#    for targetted projects, set the region
            variant_regions: /WPSnew/heyao/reference/exomCapture/S04380110_SureSelect_Human_All_Exon_V5_Covered.chr.bed
        svcaller: [cnvkit]
    #hlacaller: [optitype]

The bcbio pipeline work so well until the annotation step using gemini-vcf2db. I am getting an error caused by vcf2db command:

[2018-10-27T22:26Z] bix.go:221: chromosome GL000225.1 not found in /WPS1/heyao/tools/bio/bcbio/gemini_data/hg19.CpG.bed.gz
[2018-10-27T22:26Z] bix.go:221: chromosome GL000192.1 not found in /WPS1/heyao/tools/bio/bcbio/gemini_data/genetic_map_HapMapII_GRCh37.gz
[2018-10-27T22:26Z] bix.go:221: chromosome GL000225.1 not found in /WPS1/heyao/tools/bio/bcbio/gemini_data/hg19.dgv.bed.gz
[2018-10-27T22:26Z] bix.go:221: chromosome GL000192.1 not found in /WPS1/heyao/tools/bio/bcbio/gemini_data/stam.125cells.dnaseI.hg19.bed.gz
[2018-10-27T22:26Z] bix.go:221: chromosome GL000225.1 not found in /WPS1/heyao/tools/bio/bcbio/gemini_data/wgEncodeRegTfbsClusteredV2.cell_count.20130213.bed.gz
[2018-10-27T22:26Z] bix.go:221: chromosome GL000225.1 not found in /WPS1/heyao/tools/bio/bcbio/gemini_data/genetic_map_HapMapII_GRCh37.gz
[2018-10-27T22:26Z] bix.go:221: chromosome GL000225.1 not found in /WPS1/heyao/tools/bio/bcbio/gemini_data/stam.125cells.dnaseI.hg19.bed.gz
[2018-10-27T22:26Z] vcfanno.go:241: annotated 228965 variants in 458.57 seconds (499.3 / second)
[2018-10-27T22:26Z] tabix index D20180108_normal-germline-gatk-haplotype-nomultiallelic-annotated-gemini.vcf.gz
[2018-10-27T22:26Z] GEMINI: create database with vcf2db
[2018-10-27T22:27Z] not in VCF: D20171215_normal,D20171109_normal,D20180116_normal,D20180110_normal
[2018-10-27T22:27Z] /WPS1/heyao/tools/bio/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:226: SAWarning: Unicode type received non-unicode bind param value 'D20180108_normal-germline'. (this warning may be suppressed after 10 occurrences)
[2018-10-27T22:27Z]   (util.ellipses_string(value),))
[2018-10-27T22:27Z] /WPS1/heyao/tools/bio/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:226: SAWarning: Unicode type received non-unicode bind param value '-9'. (this warning may be suppressed after 10 occurrences)
[2018-10-27T22:27Z]   (util.ellipses_string(value),))
[2018-10-27T22:27Z] Traceback (most recent call last):
[2018-10-27T22:27Z]   File "/WPS1/heyao/tools/bio/bcbio/anaconda/bin/vcf2db.py", line 916, in <module>
[2018-10-27T22:27Z]     impacts_extras=a.impacts_field, aok=a.a_ok)
[2018-10-27T22:27Z]   File "/WPS1/heyao/tools/bio/bcbio/anaconda/bin/vcf2db.py", line 232, in __init__
[2018-10-27T22:27Z]     self.load()
[2018-10-27T22:27Z]   File "/WPS1/heyao/tools/bio/bcbio/anaconda/bin/vcf2db.py", line 317, in load
[2018-10-27T22:27Z]     i = self._load(self.cache, create=True, start=1)
[2018-10-27T22:27Z]   File "/WPS1/heyao/tools/bio/bcbio/anaconda/bin/vcf2db.py", line 280, in _load
[2018-10-27T22:27Z]     self.genotype_counts[0][gt_types == self.vcf.HOM_REF] += 1
[2018-10-27T22:27Z] IndexError: boolean index did not match indexed array along dimension 0; dimension is 1 but corresponding boolean dimension is 2
[2018-10-27T22:27Z] Uncaught exception occurred

Traceback (most recent call last):
File "/WPS1/heyao/tools/bio/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 23, in run _do_run(cmd, checks, log_stdout, env=env)
  File "/WPS1/heyao/tools/bio/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 103, in _do_run raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command '/WPS1/heyao/tools/bio/bcbio/anaconda/bin/vcf2db.py /WPSnew/heyao/project/HCC/pipeline/bcbio_exome/exome_10X_donor/work/gemini/D20180108_normal-germline-gatk-haplotype-nomultiallelic-annotated-gemini.vcf.gz /WPSnew/heyao/project/HCC/pipeline/bcbio_exome/exome_10X_donor/work/gemini/D20180108_normal-germline-gatk-haplotype-nomultiallelic.ped /WPSnew/heyao/project/HCC/pipeline/bcbio_exome/exome_10X_donor/work/bcbiotx/tmpfe1BUB/D20180108_normal-germline-gatk-haplotype.db
not in VCF: D20171215_normal,D20171109_normal,D20180116_normal,D20180110_normal
/WPS1/heyao/tools/bio/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:226: SAWarning: Unicode type received non-unicode bind param value 'D20180108_normal-germline'. (this warning may be suppressed after 10 occurrences)
  (util.ellipses_string(value),))
/WPS1/heyao/tools/bio/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:226: SAWarning: Unicode type received non-unicode bind param value '-9'. (this warning may be suppressed after 10 occurrences)
  (util.ellipses_string(value),))
Traceback (most recent call last):
  File "/WPS1/heyao/tools/bio/bcbio/anaconda/bin/vcf2db.py", line 916, in <module>
    impacts_extras=a.impacts_field, aok=a.a_ok)
  File "/WPS1/heyao/tools/bio/bcbio/anaconda/bin/vcf2db.py", line 232, in __init__
    self.load()
  File "/WPS1/heyao/tools/bio/bcbio/anaconda/bin/vcf2db.py", line 317, in load
    i = self._load(self.cache, create=True, start=1)
  File "/WPS1/heyao/tools/bio/bcbio/anaconda/bin/vcf2db.py", line 280, in _load
    self.genotype_counts[0][gt_types == self.vcf.HOM_REF] += 1
IndexError: boolean index did not match indexed array along dimension 0; dimension is 1 but corresponding boolean dimension is 2
' returned non-zero exit status 1

The gatk-haplotype vcf file and ped file are both generated by bcbio pipeline. Thus I am not sure what is wrong with my vcf file for vcf2db.

The vcf and ped file for recovering this error can be download by this link: debug files

Thanks in advance!

ahy1221 commented 6 years ago

@chapmanb, I was wondering that could you do me a favor to test my vcf files and tell me what is wrong with these files ? Since these files are totally generated by bcbio pipeline automatically, I cannot find out what is wrong in it. Thanks

ahy1221 commented 6 years ago

Hi @chapmanb , @brentp help me to find out this error is caused by D20180108 occurring twice in the ped file. After removing the repeated line of D20180108 in the ped file , the pipeline continue to work fine now. But the question now is how this wrong ped file is generated during the bcbio pipeline ? The template yaml file I feed to bcbio is shown below again:

details:
  - analysis: variant2
    genome_build: GRCh37
# In order to do paired variant calling, samples should belong to the
# same batch ("batch" under "metadata" below") and have a "phenotype"
# field stating either "normal" or "tumor". For each batch there
# should be a sample with "tumor" phenotype and a sample with "normal"
# phenotype (no more than two samples per batch)
    metadata:
        batch: your-batch-name
        phenotype: normal # or "normal"
    algorithm:
        aligner: bwa
        variantcaller: 
            somatic: [vardict, mutect2, strelka2]
            germline: [gatk-haplotype]
        ensemble:
            numpass: 2
#    for targetted projects, set the region
            variant_regions: /WPSnew/heyao/reference/exomCapture/S04380110_SureSelect_Human_All_Exon_V5_Covered.chr.bed
        svcaller: [cnvkit]
    #hlacaller: [optitype]

and the sample meta for generating the real config yaml is :

samplename,description,batch,phenotype
PE1Z_ZZM_20171109-normal-DNA,D20171109_normal,D20171109,normal
PE1Z_ZZM_20171109-tumor-DNA,D20171109_tumor,D20171109,tumor
PE1Z_ZZM_20171215-normal-DNA,D20171215_normal,D20171215,normal
PE1Z_ZZM_20171215-tumor-DNA,D20171215_tumor,D20171215,tumor
PE1Z_ZZM_20180108-normal-DNA,D20180108_normal,D20180108,normal
PE1Z_ZZM_20180108-tumor-DNA,D20180108_tumor,D20180108,tumor
PE1Z_ZZM_20180110-normal-DNA,D20180110_normal,D20180110,normal
PE1Z_ZZM_20180110-tumor-DNA,D20180110_tumor,D20180110,tumor
PE1Z_ZZM_20180116-normal-DNA,D20180116_normal,D20180116,normal
PE1Z_ZZM_20180116-tumor-DNA,D20180116_tumor,D20180116,tumor

Is it anything wrong ?

chapmanb commented 6 years ago

Yao; Thanks much for the detailed bug report and work identifying the underlying issue. I pushed a fix to bcbio to avoid double generating samples in the created ped file. They should now match the samples in the VCF file and hopefully avoid this issue. Sorry about the problem, and please let us know if you run into any other issues at all. Thanks again.

bcbio / bcbio-nextgen

vcf2db error for *germline-gatk-haplotype-nomultiallelic-annotated-gemini.vcf.gz #2550