bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
985 stars 353 forks source link

Running analysis from GVCF level with bcbio to get updated results for expanding population of samples #2336

Closed WimSpee closed 4 years ago

WimSpee commented 6 years ago

Hi @chapmanb .

In the past we typically ran analysis with bcbio from the input level of FASTQ or BAM files. We are happy with the results how scalable and reliable bcbio is.

To get a population level analysis result it's now with the availability of GATK4 much more efficient to reprocess from GVCF level for all existing (historical + new) samples than from BAM level.

Starting an analysis from GATK4 GVCF level is current not possible in bcbio as far as I know. Also it is not obvious how the BAM level QC files in this case would be aggregated to also be included in the multiqc report.

Is reprocessing from GVCF level for expanding populations of samples functionality that you think fits with bcbio and would like to cover with bcbio?

Otherwise I'll have to (re)create this functionality myself outside of bcbio (which I rather not do), so that we get the same results for our expanding population as for our new batches:

Most of this looks straight forward enough to me, if I would need to do this. I was just wondering if any values are extracted from the squared of VCF to fill in variables in the soft filter function for SNPs and INDELs. This does not look to be the case but I remember that in the past the full squared of VCF was scanned for depth values to fill in some variables in the soft filter functions. https://github.com/bcbio/bcbio-nextgen/blob/584bdbb8809faedf134ed2f97d66a1208559ad32/bcbio/variation/vfilter.py#L215

Please let me know your thoughts on this.

Thank you.

WimSpee commented 6 years ago

Maybe it's still possible to have the functionality in bcbio to still start from BAM level but have the functionality to detect if GVCF files are already in place in some directory, so to then skip the computation needed to create the GVCF files?

chapmanb commented 6 years ago

Thanks much for the suggestion and starting this discussion. We've gotten feedback that this would be useful (#1513 #2068) but haven't yet gotten it implemented, so definitely know it would be great to have. Our target would be a little less aggressive than what you suggest and would take in a set of gVCFs and joint call them with the new samples calculated as inputs. We wouldn't plan to target recalculating callable regions.

If you had ability and time to tackle this, the starting point would be to supply the gVCFs as individual samples with a vrn_file specification instead of files (https://github.com/bcbio/bcbio-nextgen/blob/master/tests/data/automated/run_info-bam.yaml#L7). Then by batching these with the samples of interest to joint call, we'd need the joint calling machinery to pick these up and include during the GenomicsDBImport and GenotypeGVCFs stages.

I'm happy to provide more specific pointers and help to drive this along if it sounds worthwhile. Thanks again.

WimSpee commented 6 years ago

Hi @chapmanb, Thank you very much for your answer. Specifying the existing GVCF files using the vrn_file field works. The only thing missing is that soft filtering is not run on the squared off multi-sample VCF file.

This makes the genotype concordance look low compared to the same samples variant called from FASTQ level or from BAM level. Could soft filtering also be run by default (or optionally) for variant calling analysis that start exclusively from GVCF files ?

I did start the analysis also with the existing BAM files to get the mapping and variant calling MultiQC report in the end. And to get the regions to parallelize the GVCF merge on.

I did not try to run a mixed analysis with some samples starting from FASTQ and some from GVCF. We like to keep that separated, i.e. a batch analysis from FASTQ for new incoming samples, and then subsequently a distinct new analysis from GVCF for the new total set of samples in the expanded population.

Below is what I exactly did.

Sample CSV table

samplename,description,batch,phenotype,sex,variant_regions,lane,coverage_interval,quality_format
DA_1052_01,VSDA_1052_01,DA_1052,,,,DA_1052_01,genome,
DA_1052_02,VSDA_1052_02,DA_1052,,,,DA_1052_02,genome,
DA_1052_03,VSDA_1052_03,DA_1052,,,,DA_1052_03,genome,

Template YAML file, not that I set aligner: false and variantcaller: false and jointcaller: gatk-haplotype-joint

# Template for whole genome Illumina variant calling with FreeBayes
# This is a GATK-free pipeline without post-alignment BAM pre-processing
# (recalibration and realignment)
---
details:
  - analysis: variant2
    genome_build: my_reference
    description:
    # to do multi-sample variant calling, assign samples the same metadata / batch
    metadata:
      batch: DA-1052
    algorithm:
      aligner: false
      mark_duplicates: false
      recalibrate: false
      realign: false
      variantcaller: false
      jointcaller: gatk-haplotype-joint
      ploidy:
          default: 2
      nomap_split_targets: 1000
      effects: false
      tools_off:
      - gemini
      # for targetted projects, set the region
      # variant_regions: /path/to/your.bed

I manually added vrn_file field with the path to the existing GVCF files for every sample. I did not yet try to put a vrn_file column with the GVCF paths in the sample CSV table.

I hope that also works so I can generate the full YAML below from the combination of the sample CSV file and the template YAML.

details:
- algorithm:
    aligner: false
    coverage_interval: genome
    effects: false
    jointcaller: gatk-haplotype-joint
    mark_duplicates: false
    nomap_split_targets: 1000
    ploidy:
      default: 2
    realign: false
    recalibrate: false
    tools_off:
    - gemini
    variantcaller: false
  analysis: variant2
  description: DA_1052_01
  files:
  - input_bam/DA_1052_01-ready.bam
  genome_build: my_reference
  lane: DA_1052_01
  vrn_file: input_gvcf/DA_1052_01.vcf.gz
  metadata:
    batch: DA_1052
- algorithm:
    aligner: false
    coverage_interval: genome
    effects: false
    jointcaller: gatk-haplotype-joint
    mark_duplicates: false
    nomap_split_targets: 1000
    ploidy:
      default: 2
    realign: false
    recalibrate: false
    tools_off:
    - gemini
    variantcaller: false
  analysis: variant2
  description: DA_1052_02
  files:
  - input_bam/DA_1052_02-ready.bam
  genome_build: my_reference
  lane: DA_1052_02
  vrn_file: input_gvcf/DA_1052_02.vcf.gz
  metadata:
    batch: DA_1052
- algorithm:
    aligner: false
    coverage_interval: genome
    effects: false
    jointcaller: gatk-haplotype-joint
    mark_duplicates: false
    nomap_split_targets: 1000
    ploidy:
      default: 2
    realign: false
    recalibrate: false
    tools_off:
    - gemini
    variantcaller: false
  analysis: variant2
  description: DA_1052_03
  files:
  - input_bam/DA_1052_03-ready.bam
  genome_build: my_reference
  lane: DA_1052_03
  vrn_file: input_gvcf/DA_1052_03.vcf.gz
  metadata:
    batch: DA_1052
fc_name: DA_1052_bam_table
upload:
  dir: ../final

This analysis indeed skipped the variant calling step and just squared of the existing GVCF files with the callable regions recalculated from the existing BAM files. Also I got the mapping and variant calling (BAM and VCF based) MultiQC report. This is what I wanted to have.

So the only thing missing is that soft filtering did not run. This is noticed from the genotype concordance results created with Picard. I compared against the same samples run from FASTQ level with BCBio.

As you can see the sensitivity of the samples processed from GVCF is 1, but the PPV only 0.92. This is because the soft filtering did no run.

VARIANT_TYPE    TRUTH_SAMPLE    CALL_SAMPLE HET_SENSITIVITY HET_PPV HET_SPECIFICITY HOMVAR_SENSITIVITY  HOMVAR_PPV  HOMVAR_SPECIFICITY  VAR_SENSITIVITY VAR_PPV VAR_SPECIFICITY GENOTYPE_CONCORDANCE    NON_REF_GENOTYPE_CONCORDANCE
SNP DA_1052_01  DA_1052_01  1.00    0.92    ?   1.00    0.99    ?   1.00    0.97    0.91    0.81    0.81
SNP DA_1052_02  DA_1052_02  1.00    0.92    ?   1.00    0.99    ?   1.00    0.96    0.92    0.75    0.75
SNP DA_1052_03  DA_1052_03  1.00    0.90    ?   1.00    0.99    ?   1.00    0.97    0.89    0.87    0.87

See also the detailed genotype concordance stats for the first sample. This shows that there are a lot of VC_FILTERED HET_REF_VAR1 and MISSING HET_REF_VAR1 discordand genotypes.

VARIANT_TYPE    TRUTH_SAMPLE    CALL_SAMPLE TRUTH_STATE CALL_STATE  COUNT   CONTINGENCY_VALUES
SNP DA_1052_01  DA_1052_01  HOM_VAR1    HOM_VAR1    2076047 TP
SNP DA_1052_01  DA_1052_01  HET_REF_VAR1    HET_REF_VAR1    846153  TP,TN
SNP DA_1052_01  DA_1052_01  VC_FILTERED HET_REF_VAR1    664864  EMPTY
SNP DA_1052_01  DA_1052_01  MISSING HET_REF_VAR1    66254   FP,TN
SNP DA_1052_01  DA_1052_01  VC_FILTERED HOM_VAR1    37459   EMPTY
SNP DA_1052_01  DA_1052_01  MISSING HOM_VAR1    17872   FP

Thank you.

WimSpee commented 6 years ago

Never mind the request previously here to update the script to create the full YAML. We can create the full YAML our selves for starting the analysis from GVCF.

It's just enabling the soft filtering to run that I would need help with.

Thank you .

chapmanb commented 6 years ago

Thanks much for testing this out and for the feedback. This made better progress than I'd expected and happy that it's almost doing what you need. I'll put adding in soft filtering and improving template support for these (with less urgency) on the to do list. I appreciate the feedback and suggestions and will update when we have a chance to work on these.

WimSpee commented 6 years ago

Hi Brad, Thank you for putting enabling soft filtering for analysis starting from GVCF on the to do list. I look forward to that functionality being in bcbio.

I am already very happy with that analysis from GVCF already mostly works via bcbio. :)

I can wait for the soft filtering, or if needed just run it as an extra manual step after bcbio for now.

Thank you again also for the bcbio software in general.