bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 354 forks source link

Error: dkfzbiasfilter_summarize.py - Invalid header #2931

Closed FedericoComoglio closed 5 years ago

FedericoComoglio commented 5 years ago

Hi,

I temporarily solved issue #2838 by adding BCBIO_DIR to myPATH. However, while this leads to successful execution of dkfzbiasfilter, the process still fails while summarizing the variants with dkfzbiasfilter_summarize.py:

Traceback (most recent call last):
  File "/mnt/tools/bin/dkfzbiasfilter_summarize.py", line 156, in <module>
    main(args[0], options)
  File "/mnt/tools/bin/dkfzbiasfilter_summarize.py", line 42, in main
    if len(set(rec.filter) & damage_filters) == len(rec.filter) or rec.info.get("DKFZBias"):
  File "pysam/libcbcf.pyx", line 2659, in pysam.libcbcf.VariantRecordInfo.get
ValueError: Invalid header

How could we fix this problem? Thanks a lot once again for your help,

Federico

roryk commented 5 years ago

Thanks-- could you pass on the variant file that is failing? You should be able to see which file it is in the log/bcbio-nextgen-commands.log file.

FedericoComoglio commented 5 years ago

Hi @roryk,

sure. This is the call to dkfzbiasfilter_summarize.py

<BCBIO>/dkfzbiasfilter_summarize.py --sample=<SAMPLE> --outfile=<PROJECT_DIR>/<SAMPLE>/work/tmp/tmpjspn971l/<SAMPLE>-damage.yaml <PROJECT_DIR>/<SAMPLE>/work/ensemble/<SAMPLE>/<SAMPLE>-ensemble-effects.vcf.gz
roryk commented 5 years ago

Thanks, sorry I mean can you pass on the actual variant file? Or at least a piece of it that has the error?

FedericoComoglio commented 5 years ago

Thanks @roryk, in this case it's sensitive data and I can't share the VCF, unfortunately.

Based on previous similar issues #1965 and #1963 it seems a likely explaination would be an invalid INFO field value. I can say this occurs systematically for all samples I processed in parallel.

roryk commented 5 years ago

Understood. Could you share the YAML configuration you are using so I can try to see if I can reproduce it locally? It doesn't have to be the one with the samples, the basic template would work.

FedericoComoglio commented 5 years ago

Sure, here it is:

details:
  - analysis: variant2
    genome_build: hg38
    metadata:
       batch: #your-batch-name
       phenotype: #tumor # or "normal"
    algorithm:
      aligner: bwa
      mark_duplicates: true
      remove_lcr: true
      variant_regions: <Regions.bed> # this a WES panel
      exclude_regions: [lcr, polyx, highdepth, altcontigs]
      min_allele_fraction: 10
      variantcaller:
        somatic: [vardict, mutect2, varscan, freebayes, strelka2]
      ensemble:
        numpass: 2
      svcaller: [manta, cnvkit]
      effects: snpeff 
      tools_on: [damage_filter]
    resources:
      tmp:
        dir: ./tmp

Thanks a lot!

chapmanb commented 5 years ago

Thanks much for the detailed report and apologies about the issue. The problem was using the ensemble output for preparing QC reports on the DNA damage estimates: we should instead be using of of the somatic callsets (VarDict, in this case). The latest development has a fix that should avoid this and hopefully finish cleanly for you. Thank you again for the help debugging and hope this gets your analysis finished.

FedericoComoglio commented 5 years ago

Brad,

thanks a lot for looking into this. I will introduce the bug fix in the my current version and test it, before upgrading to latest devel.

FedericoComoglio commented 5 years ago

Hi @chapmanb and @roryk,

this works for me now, thanks! However, I am a little confused about the output.

While the VCF files returned by the individual callers in the work directory now contain flagged variants (e.g. <sample>-effects-annotated-damage-ann.vcf.gz) and these have DKFZBiasFilter fields in the header such as ##FILTER=<ID=bPcr, the VCF files in the final directory for the same run do not contain such annotations.

In addition, I can see a summary YAML file in final/<sample>/qc/damage, but no plot is returned.

How exactly is the output of the DKFZ bias filter used by bcbio? Thank you once again!

Federico

chapmanb commented 5 years ago

Federico; Glad that fixed the run for you and thanks for the questions.

Thank you again and hope this helps.

FedericoComoglio commented 5 years ago

Brad,

thank you for looking into this! I systematically checked the final output files for that INFO and I couldn't find any hits. I need to dig deeper into this. Will update you soon.