dnanexus-rnd / GLnexus

Scalable gVCF merging and joint variant calling for population sequencing projects
Apache License 2.0
142 stars 37 forks source link

Using GLNexus to merge GATK gVCFs #216

Open edg1983 opened 4 years ago

edg1983 commented 4 years ago

Hi, I'm trying to use GLnexus v.1.2.6 to merge g.vcf files generated by the GATK HaplotypeCaller, using the gatk configuration pre-set. For my project, I'm merging WGS calls from ~300 individuals sequenced at 30-50X mean coverage. I have a couple of questions about the merging process:

  1. What exactly is the difference between gatk and gatk_unfiltered configuration pre-sets and when do you advise to use on or the other?
  2. The VCF files produced by GLNexus using gatk preset only contain AF and AQ variant annotations, with AQ equal to QUAL value. Is there a way to output part of the statistics usually contained in the VCF file merged using the standard GATK workflow?
  3. Doing QC I've noted that the GLNexus merged VCF and the one produced by GATK genotypeGVCFs are quite different. Specifically, it seems that the GLNexus one contains more variants, but with quite low ts/tv ratio (~1.8) and much more multi-allelic variants and indels. So I suspect that there is more false-positive calls in the GL nexus merged VCF. Any advice on post-merging filters for the GLNexus cohort VCF?

Thanks!

mlin commented 4 years ago

What exactly is the difference between gatk and gatk_unfiltered configuration pre-sets and when do you advise to use on or the other?

gatk_unfiltered retains every GVCF variant without regard to quality, whereas gatk applies quality filters in the merging process. gatk_unfiltered is usually not suitable for very large studies because it leads to impractical growth of the runtime and output file size (N=300 would be fine though). Some users understandably prefer the merging step to be "lossless" wrt the GVCFs, if that's practical.

The VCF files produced by GLNexus using gatk preset only contain AF and AQ variant annotations, with AQ equal to QUAL value. Is there a way to output part of the statistics usually contained in the VCF file merged using the standard GATK workflow?

GLnexus doesn't compute them because of the way it's meant to process subsets of the cohort across compute nodes for genotyping. In a complete workflow, the variant-level aggregates are easier to compute in a downstream analytics environment like Apache Spark. This answer doesn't much help users of the standalone open-source version, I know. Some more of the aggregates can be added by bcftools. If there are a selected few that'd be especially helpful to build in, we'd welcome that feedback.

Doing QC I've noted that the GLNexus merged VCF and the one produced by GATK genotypeGVCFs are quite different. Specifically, it seems that the GLNexus one contains more variants, but with quite low ts/tv ratio (~1.8) and much more multi-allelic variants and indels. So I suspect that there is more false-positive calls in the GL nexus merged VCF. Any advice on post-merging filters for the GLNexus cohort VCF?

I'd think of both GLnexus and GenotypeGVCFs as applying first-pass filters mainly meant to prevent the aforementioned blowup of runtime and file size. IIRC we calibrated the gatk preset filters to yield similar sensitivity to GenotypeGVCFs, but that was several years ago and it's possible one or both have drifted from that point. Both generally need further filtering before handing off to the downstream analysts, especially for WGS studies where lots of difficult regions are covered (in contrast to targeted WES). Heng Li's Toward better understanding of artifacts in variant calling from high-coverage samples (2014) suggests some simple and still-effective filtering dimensions.

Re multiallelic and indel sites, there is a raft of gnarly issues with overlapping variants in VCF discussed further on Reading GLnexus pVCFs and #210. Those account for some of the difference and also notable is that GenotypeGVCFs has a default setting of --max-alternate-alleles much lower than the GLnexus equivalent.

edg1983 commented 4 years ago

Hi. Thanks for the advice. I think that having SB p-value, AB fraction and total DP at the locus already computed would be useful for subsequent filtering, as suggested also in the paper you cited. I will make a more extensive comparison between joint calls obtained from genotypeGVCFs and GLNexus from the same g.vcf cohort to better understand the differences. I would prefer to stay with GLnexus if possible since it is much faster. Great work!

jiyy0216 commented 2 years ago

Hi, I'm undergraduate student from Korea University. I want to know that it is possible to directly use GATK gVCF files as input for GLNexus without any processing. Also I want to know what is the difference between joint calls obtained from genotypeGVCFs and GLNexus from the same g.vcf. It will be very helpful to me if you answer my questions. Thank you.