dnanexus-rnd / GLnexus

Scalable gVCF merging and joint variant calling for population sequencing projects
Apache License 2.0
144 stars 37 forks source link

Using GLnexus to merge gVCFs produced by isaac/starling #153

Open Alexander-Stuckey opened 5 years ago

Alexander-Stuckey commented 5 years ago

I'm trialing using GLnexus to merge gVCFs produced by the isaac/starling workflow from Illumina, as an alternative to using gvcfgenotyper (as gvcfgenotyper is extremely memory heavy).

I know it's not a supported configuration, but I have had some success using the yaml for Strelka (with some minor modifications, changing the allele_dp_format field from min_dp to dp).

This does work, and produce a merged gVCF, but I've noticed the following issues: Output format fields for AD,GQ can have . as an entry, instead of an int value. If the value for min_GQ is set higher than min_AQ1 and min_AQ2 then every variant is filtered out.

The second issue is easy to work around, but the first is presenting problems when trying to use the produced gVCF downstream, for example importing it into Hail throws an error, since it expects ints.

Is this something that you would be able to help / advise with?

Below is the full config file that I am using: unifier_config: drop_filtered: false min_allele_copy_number: 1 min_AQ1: 0 min_AQ2: 0 min_GQ: 0 max_alleles_per_site: 0 monoallelic_sites_for_lost_alleles: false preference: common genotyper_config: revise_genotypes: false min_assumed_allele_frequency: 0.0001 required_dp: 0 allow_partial_data: false allele_dp_format: AD ref_dp_format: DP output_residuals: false squeeze: false output_format: BCF liftover_fields: - {orig_names: [MIN_DP, DP, DPI], name: DP, description: "##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Approximate read depth (reads with MQ=255 or with bad mates are filtered)\">", type: int, number: basic, default_type: missing, count: 1, combi_method: min, ignore_non_variants: true} - {orig_names: [AD], name: AD, description: "##FORMAT=<ID=AD,Number=.,Type=Integer,Description=\"Allelic depths for the ref and alt alleles in the order listed\">", type: int, number: alleles, default_type: zero, count: 0, combi_method: min, ignore_non_variants: true} - {orig_names: [GQ], name: GQ, description: "##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=\"Genotype Quality\">", type: float, number: basic, default_type: missing, count: 1, combi_method: min, ignore_non_variants: true} - {orig_names: [FILTER], name: FT, description: "##FORMAT=<ID=FT,Number=1,Type=String,Description=\"FILTER field from sample gVCF\">", type: string, number: basic, default_type: missing, count: 1, combi_method: missing, ignore_non_variants: true}

mlin commented 5 years ago

Hi, I'd like to help with this -- it's usually a little bit of a project to get all the details right and to calibrate the quantitative thresholds. (The Strelka2 config itself is super rough.) Do you have any less-sensitive test article gVCF files that could be shared for that purpose (publicly or privately)? You're welcome to email me, mlin at dnanexus.com if you'd like to set up a call to discuss further!

Alexander-Stuckey commented 5 years ago

I'll see if I can wrangle some test gVCFs that I can share, I unfortunately can't share the ones that we have internally.