bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 353 forks source link

Feature request: allow pre-defined & pre-calculated callable regions in input yaml file #3380

Open WimSpee opened 4 years ago

WimSpee commented 4 years ago

Version info

To Reproduce Exact bcbio command you have used:

bcbio_nextgen.py analysis.yaml -t ipython  -n 101 -s sge  -q main.regular -r vf=5G,h_rss=50G

Your sample configuration file: Repeated for many samples

details:
- algorithm:
    aligner: false
    coverage_interval: genome
    effects: false
    jointcaller: gatk-haplotype-joint
    mark_duplicates: false   
    ploidy:
      default: 2
    realign: false
    recalibrate: false
    tools_off:
    - gemini
    variantcaller: false
  analysis: variant2
  description: sample_X 
  callable_regions: {REFERENCE_GENOME_CALLABLE_BED}
  genome_build: {REFERENCE_GENOME}
  lane: sample_X
  metadata:
    batch: DA_1234
  vrn_file: sample_X_input-gatk-haplotype.vcf.gz

Observed behavior Error message or bcbio output:

ValueError: Unexpected configuration keyword in 'algorithm' section: ['callable_regions']

Expected behavior The callable bed file to be picked up and used for joint analysis / GVCF square off.

Adding callable_regions to ALGORITHM_KEYS (line 543) results in the expected behavior https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/pipeline/run_info.py

Additional context It is computational expensive to re-calculate callable_regions for recurrent incremental joint analysis of many GVCF files. Allowing to specify a pre-calculated callable_regions bed file to be defined removes the need to specify input bam files for joint analysis, and also removes the computational cost and time. While at the same time allowing more callable areas to be analyzed in parallel than the current per chromosome default for this scenario in bcbio.

The pre-calculated callable_regions bed file can be re-used from a previous bcbio (joint) analysis. Or calculated from the reference genome via for for example via picard ScatterIntervalsByN:

picard ScatterIntervalsByN R={REFERENCE_GENOME}  O={REFERENCE_GENOME_INTERVALS} N=10000 OT=ACGT

An easy conversion from Picard interval format to bed file is still needed when using picard.

roryk commented 4 years ago

Hi @WimSpee, sorry for the delay in getting back to you, I've been trying to fix our tests and docker builds before making a bunch of changes.

Thanks, this is a good idea. Does this actually use your file correctly with your tweak? Looking at the code if it works it will only work for joint calling but it would be good to know if it at least works there.

WimSpee commented 4 years ago

Hi Rory, no worries. Thank you for having a look.

In my test the pre-defined callable_regions bed file was picked up and used by bcbio for running the joint variant calling in parallel.

We only set this pre-defined callable_regions bed file for joint analysis. For analyzing new samples from fastq to bam&gvcf we use the default nomap_split_targets functionality in bcbio.

I don't think there is much risk in adding callable_regions to ALGORITHM_KEYS. Except for people defining a callable_regions bed file that does not makes sense, but this could be pointed out in the documentation.

We will run gatk/picard GenotypeConcordance to confirm that the results produced via this way are identical for samples that were already in a previous joint variant calling.