bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
992 stars 354 forks source link

Question: callable regions generation #3701

Closed matthdsm closed 1 year ago

matthdsm commented 1 year ago

Hi,

Could you explain how the callable regions bed is made precisely? I've tried figuring it out from the code, but I've got the feeling I'm missing something.

Thanks Matthias

naumenko-sa commented 1 year ago

Hi @matthdsm !

My understanding of callable regions from the test run and the code:

  1. Cleaning of the bed file (variant_regions: kit.bed) - remove headers and sort: https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/variation/bedutils.py#L86

  2. merge the overlapping regions

  3. bgzip & tabix

  4. Calculate coverage with mosdepth: https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/variation/coverage.py#L74 the output for callable regions is sample_variant_regions.quantized.bed.gz https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/variation/coverage.py#L108

The mosdepth command:

export MOSDEPTH_Q0=NO_COVERAGE && \
export MOSDEPTH_Q1=LOW_COVERAGE && \
export MOSDEPTH_Q2=CALLABLE && \
/path/to/anaconda/envs/htslib1.10/bin/mosdepth -t 16 -F 1804 -Q 1 --no-per-base \
--by /path/to/NA12878-exome-eval/work/bedprep/cleaned-Exome-NGv3-merged.bed \
--quantize 0:1:4: /path/to/NA12878-variant_regions \
/path/to/work/align/NA12878/NA12878-prealign-dedup.bam

The thresholds for CALLABLE are being passed through quantize argument: https://github.com/brentp/mosdepth#callable-regions-example

Meaning that in bcbio:

  1. Then bcbio calculates an intersection between coverage-based mosdepth-calculated callable and variant_regions.bed: https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/variation/coverage.py#L111 I.e. the subset of the callable.bed is returned which overlaps with variant_regions.bed. The result is quantized-vrubset.bed, which contains CALLABLE, LOW_COVERAGE, NO_COVERAGE

  2. preparing for parallelization stage: bcbio determines the blocks of the genome which could be safely called by different threads, the breakpoints are inserted in non-callable regions: https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/bam/callable.py#L71 and see more logic there on how parallelization block are built

  3. In the actual calling HaplotypeCaller is run in parallel against a small BED file, for example a block for chr:114,137,346-143,402,081 which contains many callable regions.

How the regions are created: https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/variation/joint.py#L94

  1. Concatenation of variants https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/variation/vcfutils.py#L390

  2. Filtration.

SN

matthdsm commented 1 year ago

Awesome, thanks for clarifying this @naumenko-sa! Magnificent breakdown.