bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 354 forks source link

disambiguate + variants.bed crashes when trying to overlap mm10 aligned bam and hg38.bed #3081

Closed naumenko-sa closed 4 years ago

naumenko-sa commented 4 years ago

Hi!

when running variant calling on hg38 with disambiguate on mm10:

details:
- algorithm:
    aligner: bwa
    coverage: /path/coverage.bed
    coverage_interval: regional
    disambiguate:
    - mm10
    effects_transcripts: canonical_cancer
    hlacaller: optitype
    mark_duplicates: true
    min_allele_fraction: 1
    platform: Illumina
    quality_format: Standard
    realign: false
    recalibrate: false
    sv_regions: /path/coverage.bed
    svcaller:
    - manta
    - seq2c
    - cnvkit
    svprioritize: cancer/az-cancer-panel
    tools_off:
    - gemini
    tools_on:
    - qualimap_full
    - damage_filter
    - gatk4
    umi_type: fastq_name
    variant_regions: /path/variants.bed
    variantcaller:
      germline:
      - gatk-haplotype
      somatic:
      - vardict
    vcfanno: /path/vcfanno_hg38.toml
  analysis: variant2
  description: project
  files:
  - /path/project_R1.fq.gz
  - /path/project_R2.fq.gz
  genome_build: hg38
  metadata:
    batch: project-batch
    phenotype: tumor
  resources:
    fgbio:
      options:
      - --min-reads
      - 1
    qualimap:
      memory: 16g
fc_name: bcbio
upload:
  dir: ../final

It could happen that variants.bed is not overlapping with mm10 genome.

# bwa alignment to mm10
mosdepth -t 48 -F 1804  \
--no-per-base \
--by /path/variants.bed  \
./project-rawumi \
/path/bcbio/work/align/project/mm10/project-sort.bam 

Fails with:

fatal.nim(39)            sysFatal
Error: unhandled exception: index 195776982 not in 0 .. 195471971 [IndexError]
' returned non-zero exit status 1.

We should not overlap hg38 bed with mm10 bam.

S

naumenko-sa commented 4 years ago

it happens before generating consensus for mm10:

https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/ngsalign/postalign.py#L223

get_average_coverage does not know about the two bam files for hg38 and mm10 and is trying to call and rewrite coverage based on the current work bam, which is mm10 in work/align/project/project-coverage-rawumi-stats.yaml https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/variation/coverage.py#L154

the solution would be to skip checking coverage in the disambiguation case.

naumenko-sa commented 4 years ago

more precisely: it tried to get coverage from the file (cache): https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/variation/coverage.py#L157

but there is a check that the cache should be newer than the bam and the bed files. in the case of disambiguation the cache is build with hg38.bam and is older than mm10.bam which has just been aligned, so it tried to call coverage again and fails.