Closed WimSpee closed 4 years ago
Maybe it's still possible to have the functionality in bcbio to still start from BAM level but have the functionality to detect if GVCF files are already in place in some directory, so to then skip the computation needed to create the GVCF files?
Thanks much for the suggestion and starting this discussion. We've gotten feedback that this would be useful (#1513 #2068) but haven't yet gotten it implemented, so definitely know it would be great to have. Our target would be a little less aggressive than what you suggest and would take in a set of gVCFs and joint call them with the new samples calculated as inputs. We wouldn't plan to target recalculating callable regions.
If you had ability and time to tackle this, the starting point would be to supply the gVCFs as individual samples with a vrn_file
specification instead of files
(https://github.com/bcbio/bcbio-nextgen/blob/master/tests/data/automated/run_info-bam.yaml#L7). Then by batching these with the samples of interest to joint call, we'd need the joint calling machinery to pick these up and include during the GenomicsDBImport and GenotypeGVCFs stages.
I'm happy to provide more specific pointers and help to drive this along if it sounds worthwhile. Thanks again.
Hi @chapmanb, Thank you very much for your answer. Specifying the existing GVCF files using the vrn_file
field works. The only thing missing is that soft filtering
is not run on the squared off multi-sample VCF file.
This makes the genotype concordance look low compared to the same samples variant called from FASTQ level or from BAM level. Could soft filtering also be run by default (or optionally) for variant calling analysis that start exclusively from GVCF files ?
I did start the analysis also with the existing BAM files to get the mapping and variant calling MultiQC report in the end. And to get the regions to parallelize the GVCF merge on.
I did not try to run a mixed analysis with some samples starting from FASTQ and some from GVCF. We like to keep that separated, i.e. a batch analysis from FASTQ for new incoming samples, and then subsequently a distinct new analysis from GVCF for the new total set of samples in the expanded population.
Below is what I exactly did.
Sample CSV table
samplename,description,batch,phenotype,sex,variant_regions,lane,coverage_interval,quality_format
DA_1052_01,VSDA_1052_01,DA_1052,,,,DA_1052_01,genome,
DA_1052_02,VSDA_1052_02,DA_1052,,,,DA_1052_02,genome,
DA_1052_03,VSDA_1052_03,DA_1052,,,,DA_1052_03,genome,
Template YAML file, not that I set aligner: false
and variantcaller: false
and jointcaller: gatk-haplotype-joint
# Template for whole genome Illumina variant calling with FreeBayes
# This is a GATK-free pipeline without post-alignment BAM pre-processing
# (recalibration and realignment)
---
details:
- analysis: variant2
genome_build: my_reference
description:
# to do multi-sample variant calling, assign samples the same metadata / batch
metadata:
batch: DA-1052
algorithm:
aligner: false
mark_duplicates: false
recalibrate: false
realign: false
variantcaller: false
jointcaller: gatk-haplotype-joint
ploidy:
default: 2
nomap_split_targets: 1000
effects: false
tools_off:
- gemini
# for targetted projects, set the region
# variant_regions: /path/to/your.bed
I manually added vrn_file
field with the path to the existing GVCF files for every sample. I did not yet try to put a vrn_file
column with the GVCF paths in the sample CSV table.
I hope that also works so I can generate the full YAML below from the combination of the sample CSV file and the template YAML.
details:
- algorithm:
aligner: false
coverage_interval: genome
effects: false
jointcaller: gatk-haplotype-joint
mark_duplicates: false
nomap_split_targets: 1000
ploidy:
default: 2
realign: false
recalibrate: false
tools_off:
- gemini
variantcaller: false
analysis: variant2
description: DA_1052_01
files:
- input_bam/DA_1052_01-ready.bam
genome_build: my_reference
lane: DA_1052_01
vrn_file: input_gvcf/DA_1052_01.vcf.gz
metadata:
batch: DA_1052
- algorithm:
aligner: false
coverage_interval: genome
effects: false
jointcaller: gatk-haplotype-joint
mark_duplicates: false
nomap_split_targets: 1000
ploidy:
default: 2
realign: false
recalibrate: false
tools_off:
- gemini
variantcaller: false
analysis: variant2
description: DA_1052_02
files:
- input_bam/DA_1052_02-ready.bam
genome_build: my_reference
lane: DA_1052_02
vrn_file: input_gvcf/DA_1052_02.vcf.gz
metadata:
batch: DA_1052
- algorithm:
aligner: false
coverage_interval: genome
effects: false
jointcaller: gatk-haplotype-joint
mark_duplicates: false
nomap_split_targets: 1000
ploidy:
default: 2
realign: false
recalibrate: false
tools_off:
- gemini
variantcaller: false
analysis: variant2
description: DA_1052_03
files:
- input_bam/DA_1052_03-ready.bam
genome_build: my_reference
lane: DA_1052_03
vrn_file: input_gvcf/DA_1052_03.vcf.gz
metadata:
batch: DA_1052
fc_name: DA_1052_bam_table
upload:
dir: ../final
This analysis indeed skipped the variant calling step and just squared of the existing GVCF files with the callable regions recalculated from the existing BAM files. Also I got the mapping and variant calling (BAM and VCF based) MultiQC report. This is what I wanted to have.
So the only thing missing is that soft filtering did not run. This is noticed from the genotype concordance results created with Picard. I compared against the same samples run from FASTQ level with BCBio.
As you can see the sensitivity of the samples processed from GVCF is 1, but the PPV only 0.92. This is because the soft filtering did no run.
VARIANT_TYPE TRUTH_SAMPLE CALL_SAMPLE HET_SENSITIVITY HET_PPV HET_SPECIFICITY HOMVAR_SENSITIVITY HOMVAR_PPV HOMVAR_SPECIFICITY VAR_SENSITIVITY VAR_PPV VAR_SPECIFICITY GENOTYPE_CONCORDANCE NON_REF_GENOTYPE_CONCORDANCE
SNP DA_1052_01 DA_1052_01 1.00 0.92 ? 1.00 0.99 ? 1.00 0.97 0.91 0.81 0.81
SNP DA_1052_02 DA_1052_02 1.00 0.92 ? 1.00 0.99 ? 1.00 0.96 0.92 0.75 0.75
SNP DA_1052_03 DA_1052_03 1.00 0.90 ? 1.00 0.99 ? 1.00 0.97 0.89 0.87 0.87
See also the detailed genotype concordance stats for the first sample. This shows that there are a lot of VC_FILTERED HET_REF_VAR1
and MISSING HET_REF_VAR1
discordand genotypes.
VARIANT_TYPE TRUTH_SAMPLE CALL_SAMPLE TRUTH_STATE CALL_STATE COUNT CONTINGENCY_VALUES
SNP DA_1052_01 DA_1052_01 HOM_VAR1 HOM_VAR1 2076047 TP
SNP DA_1052_01 DA_1052_01 HET_REF_VAR1 HET_REF_VAR1 846153 TP,TN
SNP DA_1052_01 DA_1052_01 VC_FILTERED HET_REF_VAR1 664864 EMPTY
SNP DA_1052_01 DA_1052_01 MISSING HET_REF_VAR1 66254 FP,TN
SNP DA_1052_01 DA_1052_01 VC_FILTERED HOM_VAR1 37459 EMPTY
SNP DA_1052_01 DA_1052_01 MISSING HOM_VAR1 17872 FP
Thank you.
Never mind the request previously here to update the script to create the full YAML. We can create the full YAML our selves for starting the analysis from GVCF.
It's just enabling the soft filtering
to run that I would need help with.
Thank you .
Thanks much for testing this out and for the feedback. This made better progress than I'd expected and happy that it's almost doing what you need. I'll put adding in soft filtering and improving template support for these (with less urgency) on the to do list. I appreciate the feedback and suggestions and will update when we have a chance to work on these.
Hi Brad, Thank you for putting enabling soft filtering for analysis starting from GVCF on the to do list. I look forward to that functionality being in bcbio.
I am already very happy with that analysis from GVCF already mostly works via bcbio. :)
I can wait for the soft filtering, or if needed just run it as an extra manual step after bcbio for now.
Thank you again also for the bcbio software in general.
Hi @chapmanb .
In the past we typically ran analysis with bcbio from the input level of FASTQ or BAM files. We are happy with the results how scalable and reliable bcbio is.
To get a population level analysis result it's now with the availability of GATK4 much more efficient to reprocess from GVCF level for all existing (historical + new) samples than from BAM level.
Starting an analysis from GATK4 GVCF level is current not possible in bcbio as far as I know. Also it is not obvious how the BAM level QC files in this case would be aggregated to also be included in the multiqc report.
Is reprocessing from GVCF level for expanding populations of samples functionality that you think fits with bcbio and would like to cover with bcbio?
Otherwise I'll have to (re)create this functionality myself outside of bcbio (which I rather not do), so that we get the same results for our expanding population as for our new batches:
Most of this looks straight forward enough to me, if I would need to do this. I was just wondering if any values are extracted from the squared of VCF to fill in variables in the soft filter function for SNPs and INDELs. This does not look to be the case but I remember that in the past the full squared of VCF was scanned for depth values to fill in some variables in the soft filter functions. https://github.com/bcbio/bcbio-nextgen/blob/584bdbb8809faedf134ed2f97d66a1208559ad32/bcbio/variation/vfilter.py#L215
Please let me know your thoughts on this.
Thank you.