bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
991 stars 354 forks source link

mosdepth behaves abnormal. #3581

Closed jxshi closed 2 years ago

jxshi commented 2 years ago

Version info

To Reproduce Exact bcbio command you have used:

bcbio_nextgen.py ../config/hu.yaml -n 96

Your yaml configuration file:

details:
  - analysis: variant2
    genome_build: GRCh37
    # In order to do paired variant calling, samples should belong to the
    # same batch ("batch" under "metadata" below") and have a "phenotype"
    # field stating either "normal" or "tumor". For each batch there
    # should be a sample with "tumor" phenotype and a sample with "normal"
    # phenotype (no more than two samples per batch)
    metadata:
       batch: your-batch-name
       phenotype: tumor # or "normal"
    algorithm:
      #trim_reads: [fastp]
      #align_split_size: false
      aligner: minimap2
      platform: Illumina
      save_diskspace: true
      mark_duplicates: true
      remove_lcr: true
      recalibrate: false
      realign: false
      variantcaller: [vardict,varscan]
      #variantcaller: [vardict]
      tools_off:
      - gatk4
      - picard
      - mosdepth
      min_allele_fraction: 0.5
      #use_lowfreq_filter: false
      #svcaller: [lumpy, manta]
      #svprioritize: cancer/civic
      #hlacaller: optitype
      exclude_regions: [lcr]
      ensemble:
        numpass: 2
        use_filtered: true
      # for targetted projects, set the region
      # variant_regions: /path/to/your.bed

Log files (could be found in work/log) Please attach (10MB max): bcbio-nextgen-commands.log, and bcbio-nextgen-debug.log.

bcbio-nextgen-commands log

[2021-12-10T02:35Z] /home/zzusah/local/share/bcbio/anaconda/envs/htslib1.10/bin/mosdepth -t 47 /mnt/ONCOBOX/zzusah/data/results/hu/work/qc/huqiusheng/viral/huqiusheng-gdc-viral /mnt/ONCOBOX/zzusah/data/results/hu/work/qc/huqiusheng/viral/huqiusheng-gdc-viral.bam -n --thresholds 1,5,25 --by <(awk 'BEGIN {FS="\t"}; {print $1 FS "0" FS $2}' /home/zzusah/local/share/bcbio/genomes/Hsapiens/GRCh37/viral/gdc-viral.fa.fai) && echo '## Viral sequences (from https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files) found in unmapped reads' > /mnt/ONCOBOX/zzusah/data/results/hu/work/bcbiotx/tmpy08bxy2a/huqiusheng-gdc-viral-completeness.txt &&echo '## Sample: huqiusheng' >> /mnt/ONCOBOX/zzusah/data/results/hu/work/bcbiotx/tmpy08bxy2a/huqiusheng-gdc-viral-completeness.txt && echo '#virus    size    depth   1x      5x      25x     reads   reads_pct' >> /mnt/ONCOBOX/zzusah/data/results/hu/work/bcbiotx/tmpy08bxy2a/huqiusheng-gdc-viral-completeness.txt && paste <(zcat /mnt/ONCOBOX/zzusah/data/results/hu/work/qc/huqiusheng/viral/huqiusheng-gdc-viral.regions.bed.gz) <(zgrep -v ^# /mnt/ONCOBOX/zzusah/data/results/hu/work/qc/huqiusheng/viral/huqiusheng-gdc-viral.thresholds.bed.gz) <(samtools idxstats /mnt/ONCOBOX/zzusah/data/results/hu/work/qc/huqiusheng/viral/huqiusheng-gdc-viral.bam | grep -v '*') | awk 'BEGIN {FS="\t"} { print $1 FS $3 FS $4 FS $10/$3 FS $11/$3 FS $12/$3 FS $15 FS $15/85589025}' | sort -n -r -k 5,5 >> /mnt/ONCOBOX/zzusah/data/results/hu/work/bcbiotx/tmpy08bxy2a/huqiusheng-gdc-viral-completeness.txt

bcbio-nextgen-debug.log

[2021-12-10T02:35Z] Analyse coverage of viral genomes
[2021-12-10T02:35Z] bam.nim(381)             open
[2021-12-10T02:35Z] Error: unhandled exception: invalid bgzf file [ValueError]
[2021-12-10T02:35Z] Uncaught exception occurred
Traceback (most recent call last):
  File "/home/zzusah/local/share/bcbio/anaconda/lib/python3.7/site-packages/bcbio/provenance/do.py", line 26, in run
    _do_run(cmd, checks, log_stdout, env=env)
  File "/home/zzusah/local/share/bcbio/anaconda/lib/python3.7/site-packages/bcbio/provenance/do.py", line 106, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
subprocess.CalledProcessError: Command 'set -o pipefail; /home/zzusah/local/share/bcbio/anaconda/envs/htslib1.10/bin/mosdepth -t 47 /mnt/ONCOBOX/zzusah/data/results/hu/work/qc/huqiusheng/viral/huqiusheng-gdc-viral /mnt/ONCOBOX/zzusah/data/results/hu/work/qc/huqiusheng/viral/huqiusheng-gdc-viral.bam -n --thresholds 1,5,25 --by <(awk 'BEGIN {FS="\t"}; {print $1 FS "0" FS $2}' /home/zzusah/local/share/bcbio/genomes/Hsapiens/GRCh37/viral/gdc-viral.fa.fai) && echo '## Viral sequences (from https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files) found in unmapped reads' > /mnt/ONCOBOX/zzusah/data/results/hu/work/bcbiotx/tmpy08bxy2a/huqiusheng-gdc-viral-completeness.txt &&echo '## Sample: huqiusheng' >> /mnt/ONCOBOX/zzusah/data/results/hu/work/bcbiotx/tmpy08bxy2a/huqiusheng-gdc-viral-completeness.txt && echo '#virus       size    depth   1x      5x      25x     reads   reads_pct' >> /mnt/ONCOBOX/zzusah/data/results/hu/work/bcbiotx/tmpy08bxy2a/huqiusheng-gdc-viral-completeness.txt && paste <(zcat /mnt/ONCOBOX/zzusah/data/results/hu/work/qc/huqiusheng/viral/huqiusheng-gdc-viral.regions.bed.gz) <(zgrep -v ^# /mnt/ONCOBOX/zzusah/data/results/hu/work/qc/huqiusheng/viral/huqiusheng-gdc-viral.thresholds.bed.gz) <(samtools idxstats /mnt/ONCOBOX/zzusah/data/results/hu/work/qc/huqiusheng/viral/huqiusheng-gdc-viral.bam | grep -v '*') | awk 'BEGIN {FS="\t"} { print $1 FS $3 FS $4 FS $10/$3 FS $11/$3 FS $12/$3 FS $15 FS $15/85589025}' | sort -n -r -k 5,5 >> /mnt/ONCOBOX/zzusah/data/results/hu/work/bcbiotx/tmpy08bxy2a/huqiusheng-gdc-viral-completeness.txt
bam.nim(381)             open
Error: unhandled exception: invalid bgzf file [ValueError]
' returned non-zero exit status 1.
naumenko-sa commented 2 years ago

yes, this one was giving me a hard time as well.

The very issue is well known for many years in the old samtools code: https://github.com/lh3/samtools/pull/7

But why we are hitting it here, I am not sure. In the debugger if you delete the viral.bam and re-creates just well.

I've made the samtools call safer here: https://github.com/bcbio/bcbio-nextgen/pull/3582

And it seems to be solving the issue in my bcbio run.

Let me know how it works for you (you'd need to upgrade to the latest code).

Sergey

jxshi commented 2 years ago

Hi @naumenko-sa ,

I circumvented this issue by turning off viral in the template yaml file with the following option.

tools_off:
- viral

And finally, it went through. Cheers!

Best, Jianxiang

naumenko-sa commented 2 years ago

Thanks for confirming! In my project, it finished with multicore bcbio (one-node, as opposed to ipython).

naumenko-sa commented 2 years ago

see also this fix: https://github.com/bcbio/bcbio-nextgen/pull/3587