bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
992 stars 354 forks source link

fgbio and manta incompatibilities with duplex UMIs #3662

Open kmendler opened 2 years ago

kmendler commented 2 years ago

Version info

To Reproduce Exact bcbio command you have used:

bcbio_nextgen.py bcbio.yaml -t ipython -s slurm -n 16 -q core -p "Dev_4031" --timeout 2000 -r t=5-00:00 -r conmem=64

Your yaml configuration file (template):

---
details:
  - analysis: variant2
    genome_build: hg38
    algorithm:
      aligner: bwa
      mark_duplicates: true
      recalibrate: false
      realign: false
      variantcaller:
        somatic: [vardict]
        germline: [gatk-haplotype]
      svcaller: [manta, seq2c, cnvkit]
      svprioritize: cancer/az-cancer-panel
      platform: Illumina
      quality_format: Standard
      # Panel:MMNFKB
      variant_regions: /projects/ngs/reference/seqauto_bed/hg38/Panel/MMNFKB/variants.bed
      sv_regions: /projects/ngs/reference/seqauto_bed/hg38/Panel/MMNFKB/coverage.bed
      coverage: /projects/ngs/reference/seqauto_bed/hg38/Panel/MMNFKB/coverage.bed
      min_allele_fraction: 0.01
      tools_off: [gemini, contamination]
      tools_on: [qualimap_full, damage_filter, gatk4]
      effects_transcripts: canonical_cancer
      umi_type: fastq_name  # UMI:duplex_twinstrand
      trim_ends: [2, 0, 2, 0]
      use_lowfreq_filter: false
      correct_umis: /swiftcache/ngs/oncology/datasets/us_novaseq6000_2/220406_A00203_0268_AHWFMTDSX2/umi_whitelist.txt
      align_split_size: false
      coverage_interval: regional
    resources:
      fgbio:
        options: [--min-reads, 3]

Log files (could be found in work/log) bcbio-nextgen-commands.log bcbio-nextgen-debug.log

Manta is raising an error indicating that the base quality score for a read is too high. According to the fgbio docs, the quality scores for consensus duplex reads will be relatively twice the score of a regular read if the bases on both strands agree (https://github.com/fulcrumgenomics/fgbio/wiki/Calling-Duplex-Consensus-Reads#calling-double-stranded-consensus-reads), thus giving them a very high value that falls above manta's threshold (70). For example, in a standard 'clean' library, we would see many bases called with Q33 >= 35. The quality of the final consensus read would be twice that (>=70), thus breaking manta. I'd imagine this error would occur for projects where libraries contain clean reads with duplex UMIs.

There is no urgency in this request, as we're looking into using BAMs from the Illumina DRAGEN aligner with manta outside bcbio, but I thought I should flag it here for awareness.

gudeqing commented 1 year ago

Hi, I also encountered this issue. Is there a solution available now?

Tanks!