bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 354 forks source link

bcftools report error about undefined tags in header #2372

Closed phu5ion closed 5 years ago

phu5ion commented 6 years ago

Hi,

Sorry for opening another issue. It looks like there is something wrong with either bcftools command or the header of scalpel vcf files:

Traceback (most recent call last):
  File "/mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 23, in run
    _do_run(cmd, checks, log_stdout, env=env)
  File "/mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 103, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; vcfcat <(/mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/bin/bcftools filter -m '+' -O v --soft-filter 'CHI2FILTER' -e 'INFO/CHI2 > 20.0'  /mnt/projects/ngbhs/cirqseq/2017_bcbio/20180410_2FFPE-slides/2FFPE-slides/work/mutect/chr4/TGIS-18-NE-00001-chr4_54261094_152323149-somaticIndels-scalpel-work/main/somatic.indel.vcf.gz) <(/mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/bin/bcftools filter -m '+' -O v --soft-filter 'REJECT' -e '%TYPE="indel"'  /mnt/projects/ngbhs/cirqseq/2017_bcbio/20180410_2FFPE-slides/2FFPE-slides/work/mutect/chr4/TGIS-18-NE-00001-chr4_54261094_152323149-somaticIndels-scalpel-work/main/common.indel.vcf.gz) |  awk -F$'\t' -v OFS='\t' '{if ($0 !~ /^#/) gsub(/[KMRYSWBVHDXkmryswbvhdx]/, "N", $4) } {print}' | /mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/bin/vcfstreamsort | grep -v ^##contig | bcftools annotate -h /mnt/projects/ngbhs/cirqseq/2017_bcbio/20180410_2FFPE-slides/2FFPE-slides/work/bcbiotx/tmpLiVKUu/TGIS-18-NE-00001-chr4_54261094_152323149-somaticIndels-contig_header.txt | bgzip -c > /mnt/projects/ngbhs/cirqseq/2017_bcbio/20180410_2FFPE-slides/2FFPE-slides/work/bcbiotx/tmpLiVKUu/TGIS-18-NE-00001-chr4_54261094_152323149-somaticIndels.vcf.gz
[W::vcf_parse] FILTER 'REJECT' is not defined in the header
Encountered error, cannot proceed. Please check the error output above.
' returned non-zero exit status 255

This is what the header of the problematic file looks like:

##fileformat=VCFv4.1
##fileDate=04/18/2018
##source=scalpel0.5.3 (beta), January 25 2016
##reference=/mnt/projects/dlho/tancrc/bcbio_pipeline/genomes/Hsapiens/hg38/seq/hg38.fa
##INFO=<ID=AVGCOV,Number=1,Type=Float,Description="average k-mer coverage">
##INFO=<ID=MINCOV,Number=1,Type=Integer,Description="minimum k-mer coverage of non-reference allele">
##INFO=<ID=ALTCOV,Number=1,Type=Integer,Description="k-mer coverage of reference + any other allele (different from current non-reference) at locus">
##INFO=<ID=ZYG,Number=1,Type=String,Description="zygosity">
##INFO=<ID=COVRATIO,Number=1,Type=Float,Description="coverage ratio [(MINCOV)/(ALTCOV+MINCOV)]">
##INFO=<ID=CHI2,Number=1,Type=Float,Description="chi-square score">
##INFO=<ID=FISHERPHREDSCORE,Number=1,Type=Float,Description="phred-scaled p-value from the Fisher's exact test for tumor-normal allele counts">
##INFO=<ID=INH,Number=1,Type=String,Description="inheritance">
##INFO=<ID=BESTSTATE,Number=1,Type=String,Description="state of the mutation">
##INFO=<ID=COVSTATE,Number=1,Type=String,Description="coverage state of the mutation">
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Somatic mutation">
##INFO=<ID=DENOVO,Number=0,Type=Flag,Description="De novo mutation">
##FILTER=<ID=MS,Description="Microsatellite mutation (format: #LEN#MOTIF)">
##FILTER=<ID=LowCovNormal,Description="low coverage in the normal (<10)">
##FILTER=<ID=HighCovNormal,Description="high coverage in the normal (>1000000000)">
##FILTER=<ID=LowCovTumor,Description="low coverage in the tumor (<4)">
##FILTER=<ID=HighCovTumor,Description="high coverage in the tumor (>1000000000)">
##FILTER=<ID=LowVafTumor,Description="low variant allele frequency in the tumor (<0.1)">
##FILTER=<ID=HighVafNormal,Description="high variant allele frequency in the normal (>1.0)">
##FILTER=<ID=LowAltCntTumor,Description="low alternative allele count in the tumor (<5)">
##FILTER=<ID=HighAltCntNormal,Description="high alternative allele count in the normal (>1000000)">
##FILTER=<ID=LowFisherScore,Description="low Fisher's exact test score for tumor-normal allele counts (<10)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="k-mer Depth">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="k-mer depth supporting reference/indel at the site">

Just some thoughts, but I was wondering whether the problem is as described in the error message, and it's because the info field "REJECT" is not defined in the header. Could anyone help pls? Thanks!

chapmanb commented 6 years ago

Sorry about the issue and thanks for the report. Would you be able to share your input YAML file to understand better how you're running scalpel? Is it possible you're trying to run it as a standalone caller, like this issue:

https://github.com/bcbio/bcbio-nextgen/issues/1586

In general we haven't done a ton of work on scalpel recently and might also be able to offer alternative suggestions for callers with an understanding of your use case. Hope this helps.

phu5ion commented 6 years ago

Hi Brad,

Sure.

details:
- algorithm:
    aligner: bwa
    mark_duplicates: true
    recalibrate: gatk
    realign: gatk
    coverage: /home/dlho/references/bedfiles/Cancer216-full-targets_hg38.bed
    variantcaller: [mutect, varscan, freebayes]
    indelcaller: scalpel
    svcaller: [lumpy, manta, cnvkit]
    svprioritize: /home/dlho/references/bedfiles/Cancer216-full-targets.list
    effects: vep
    effects_transcripts: canonical
    save_diskspace: true
    tools_off: [gemini, gatk4]
  analysis: variant2
  genome_build: hg38

I'm running scalpel as an indel caller in addition to Mutect, so as to call both snvs and indels. I'm open to using Mutect2 as an alternative since it does the same by itself too, but was wondering about its performance vs Mutect. Would appreciate any advice you have about this issue, thanks!

phu5ion commented 6 years ago

Hi,

I was wondering how's the issue on the problems with scalpel described above?

As a side, I'm running Mutect2 and not too happy with the default settings. May I know how to change them? Specifically I'll like to change this argument because Mutect2 is missing out real variants: max_alt_alleles_in_normal_count=1

Thanks!

chapmanb commented 6 years ago

Thanks for the additional details. That all looks good in terms of scalpel usage so I'm not sure why it's generating an incorrect header without the REJECT filter tag. If you'd be able to supply a small example that does this we could try to reproduce and fix.

Alternatively, I'd suggest trying to substitute in vardict and strelka2 in place of using scalpel for indel calling. Both have nice indel sensitivity for somatic calling and can avoid the issue entirely.

For MuTect2, you can tweak and add other options by specifying them in the resources (https://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#sample-or-run-specific-resources)

resources:
  mutect2:
    options: []

We'd have a lot of interest in hearing out your experiences tweaking MuTect2. We've also been discussing in #2357 with regards to tumor-only calling so welcome hearing what works well for your samples. Thanks again for all the work debugging and hope this helps.

roryk commented 5 years ago

Thanks, closing this out-- there is a reasonable workaround suggested which should solve it. The Mutect2 issue is being discussed in #2357 so that will probably be the issue to talk about this stuff going forward.