bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
992 stars 354 forks source link

One sample's bed is used for bcftools filter of joint-called variants? #2401

Closed hliang closed 6 years ago

hliang commented 6 years ago

I'm running joint variant calling for a batch of 80 exome samples, using gatk-haplotypecaller. The yaml file looks like below, as you see, no target region is specified.

- algorithm:
    align_split_size: false
    aligner: bwa
    effects: snpeff
    mark_duplicates: true
    realign: false
    recalibrate: false
    tools_off:
    - gemini
    variantcaller: gatk-haplotype
  analysis: variant2
  description: Exom0001
  files:
  - /path/to/input/Exom0001_R1.fq.gz
  - /path/to/input/Exom0001_R2.fq.gz
  genome_build: hg19
  lane: Exom0001
  metadata:
    batch: batchX
    sex: unknown

At post-processing stage, some messages in bcbio-nextgen.log catch my eye:

...
[2018-05-18T01:40Z] cn7012: Timing: variant post-processing
[2018-05-18T01:40Z] cn7012: ipython: postprocess_variants
[2018-05-18T01:40Z] cn7017: Finalizing variant calls: Exom1234, gatk-haplotype
[2018-05-18T01:40Z] cn7017: Calculating variation effects for Exom1234, gatk-haplotype
[2018-05-18T02:13Z] cn7017: Annotate VCF file: Exom1234, gatk-haplotype
[2018-05-18T02:20Z] cn7017: Filtering for Exom1234, gatk-haplotype
[2018-05-18T02:27Z] cn7017: Prioritization for Exom1234, gatk-haplotype
[2018-05-18T02:26Z] cn7012: ipython: split_variants_by_sample
...

The commands.log shows that bcftools filter is run with -T option using a bed file from one sample (not the first or last sample, more like a random one).

[2018-05-18T01:40Z] cn7017: unset JAVA_HOME && export PATH=/path/to/bcbio/anaconda/bin:$PATH &&  /path/to/bcbio/anaconda/bin/snpEff -Xms750m -Xmx33g -Djava.io.tmpdir=/dev/shm/tmp1psc6i/tmp eff -dataDir /path/to/bcbio/genomes/Hsapiens/hg19/snpeff -hgvs -noLog -i vcf -o vcf -csvStats /path/to/prj/work/gatk-haplotype/batchX-effects-stats.csv -s /path/to/prj/work/gatk-haplotype/batchX-effects-stats.html GRCh37.75 /path/to/prj/work/gatk-haplotype/batchX.vcf.gz | /path/to/bcbio/anaconda/bin/bgzip --threads 12 -c > /dev/shm/tmp1psc6i/batchX-effects.vcf.gz
[2018-05-18T02:12Z] cn7017: /path/to/bcbio/anaconda/bin/tabix -f -p vcf /dev/shm/tmpfbvYAU/batchX-effects.vcf.gz
[2018-05-18T02:13Z] cn7017: vcfanno -p 12 /path/to/prj/work/gatk-haplotype/dbsnp.conf /path/to/prj/work/gatk-haplotype/batchX-effects.vcf.gz |  bgzip -c > /dev/shm/tmpo1P_Zz/batchX-effects-annotated.vcf.gz
[2018-05-18T02:18Z] cn7017: /path/to/bcbio/anaconda/bin/tabix -f -p vcf /dev/shm/tmp4lp9uP/batchX-effects-annotated.vcf.gz
[2018-05-18T02:25Z] cn7017: /path/to/bcbio/anaconda/bin/bcftools filter -O v -T /path/to/prj/work/bedprep/cleaned-Exom1234-sort-callable_sample.bed.gz --soft-filter 'GATKCutoffSNP' -e 'TYPE="snp" && (MQRankSum < -12.5 || ReadPosRankSum < -8.0 || QD < 2.0 || FS > 60.0 || (QD < 10.0 && AD[1] / (AD[1] + AD[0]) < 0.25 && ReadPosRankSum < 0.0) || MQ < 30.0)' -m '+' /path/to/prj/work/gatk-haplotype/batchX-effects-annotated.vcf.gz | sed 's/\\"//g' | bgzip -c > /dev/shm/tmpZgS1U5/batchX-effects-annotated-filterSNP.vcf.gz
[2018-05-18T02:27Z] cn7017: /path/to/bcbio/anaconda/bin/tabix -f -p vcf /dev/shm/tmpRZZrrF/batchX-effects-annotated-filterSNP.vcf.gz
[2018-05-18T02:27Z] cn7017: /path/to/bcbio/anaconda/bin/bcftools filter -O v -T /path/to/prj/work/bedprep/cleaned-Exom1234-sort-callable_sample.bed.gz --soft-filter 'GATKCutoffIndel' -e 'TYPE="indel" && (ReadPosRankSum < -20.0 || QD < 2.0 || FS > 200.0 || SOR > 10.0 || (QD < 10.0 && AD[1] / (AD[1] + AD[0]) < 0.25 && ReadPosRankSum < 0.0))' -m '+' /path/to/prj/work/gatk-haplotype/batchX-effects-annotated-filterSNP.vcf.gz | sed 's/\\"//g' | bgzip -c > /dev/shm/tmpxfWbyw/batchX-effects-annotated-filterSNP-filterINDEL.vcf.gz
[2018-05-18T02:27Z] cn7017: /path/to/bcbio/anaconda/bin/tabix -f -p vcf /dev/shm/tmpk0Y9t2/batchX-effects-annotated-filterSNP-filterINDEL.vcf.gz

This behavior would result in some variants being absent in the result. When a poorly covered sample is selected for target region, variants in well covered samples might be filtered and excluded (if these sites are not covered in the poorly covered sample).

chapmanb commented 6 years ago

Thanks much for the report and apologies about the issue. You're right that the proper behavior should be to avoid an additional regional filter here if variant_regions was not specified in the configuration. The logic bcbio uses is to joint/batch call over the regions with the most coverage:

https://github.com/bcbio/bcbio-nextgen/blob/b76158ebd055c22e2be9a074555a2426713acb79/bcbio/variation/bedutils.py#L171

but there is no need to apply extra selection afterwards. If you update to the latest development version:

bcbio_nextgen.py upgrade -u development

it should now do the right thing here.

More generally, if this is a targeted experiment you should specify the target regions with variant_regions in the input. This avoids calling in off-target regions, which is generally noise, and also avoids the necessarily imperfect logic of bcbio trying to figure out what set of regions over multiple samples you really want to call on.

Thanks again for the report and hope this helps.

hliang commented 6 years ago

Great. It works. Thank you for the fix.