dnanexus-rnd / GLnexus

Scalable gVCF merging and joint variant calling for population sequencing projects
Apache License 2.0
150 stars 38 forks source link

AF calculation from DeepVariant #161

Open bopohdr opened 5 years ago

bopohdr commented 5 years ago

I am trying to use GLnexus on DNAnexus platform for gvcfs produced by DeepVariant. For positions that was not covered (or by some reads only), the genotype is 0/0 - which assumes it is homozygous and is used for AF calculation. Example: `

CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | ZL556| ZL553

-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- chr1 | 1020217 | chr1_1020217_G_T | G | T | 0 | . | AF=0.25 | GT:DP:AD:GQ:PL:RNC | 0/0:0:0,0:1:.:.. | 0/1:3:1,2:3:0,0,26:.. chr1 | 1041134 | chr1_1041134_C_A | C | A | 33 | . | AF=0.5;AQ=33 | GT:DP:AD:GQ:PL:RNC | 0/1:4:0,4:28:33,0,29:.. | 0/1:8:1,7:16:17,0,21:.. ` First case has AF=0.25 while sample ZL556 do not have any reads at this positions.

For sentieon it was at least ./. or similar.

Is it possible to add/change/improve something in DNAnexus version to take low DP or 0 DP in account, when calculating AF and assigning genotypes ?

xunjieli commented 5 years ago

Doesn't GLnexus already support this? You can do a DP filtering by specifying the genotyper_config::required_dp (https://github.com/dnanexus-rnd/GLnexus/blob/master/include/types.h#L702)

bopohdr commented 5 years ago

Doesn't GLnexus already support this? You can do a DP filtering by specifying the genotyper_config::required_dp (https://github.com/dnanexus-rnd/GLnexus/blob/master/include/types.h#L702)

I want to use online version deposited at DNAnexus and to my knowledge there is no such option to change the DP available

mlin commented 5 years ago

My apologies for missing this and the earlier question. Yes, currently the DNAnexus applet lets you choose from among the preset configurations, but lacks the capability to take in a custom YML configuration file as glnexus_cli can from the command-line. This is easy to add and I'll do so shortly.

mlin commented 5 years ago

@bopohdr I have added the necessary dxapplet input in v1.1.5 which you can find in the GLnexus_Getting_Started project.

Here's an example yml file with a version of the DeepVariant configuration plus genotyper_config.required_dp = 1. This will cause GLnexus to no-call cells with zero supporting reads even if the input gVCF says 0/0.

Upload that yml to your project and then you can run it like so

dx run GLnexus -i config_yml=DeepVariant_DP1.yml -i bed_ranges_to_genotype=hg19_chr21.bed -i output_name=dv_platinum6_chr21 -i gvcf=dv_platinum6_chr21_gvcf/NA12877.chr21.gvcf.gz -i gvcf=dv_platinum6_chr21_gvcf/NA12878.chr21.gvcf.gz -i gvcf=dv_platinum6_chr21_gvcf/NA12891.chr21.gvcf.gz -i gvcf=dv_platinum6_chr21_gvcf/NA12890.chr21.gvcf.gz -i gvcf=dv_platinum6_chr21_gvcf/NA12889.chr21.gvcf.gz -i gvcf=dv_platinum6_chr21_gvcf/NA12892.chr21.gvcf.gz         -y --watch

GLnexus echoes the active configuration YAML into the job log so you can double-check the required_dp setting.

HTH & I'm sorry again for missing the inquiry before -- too many gmail accounts these days :(

bopohdr commented 5 years ago

@bopohdr I have added the necessary dxapplet input in v1.1.5 which you can find in the GLnexus_Getting_Started project.

Here's an example yml file with a version of the DeepVariant configuration plus genotyper_config.required_dp = 1. This will cause GLnexus to no-call cells with zero supporting reads even if the input gVCF says 0/0.

Upload that yml to your project and then you can run it like so

dx run GLnexus -i config_yml=DeepVariant_DP1.yml -i bed_ranges_to_genotype=hg19_chr21.bed -i output_name=dv_platinum6_chr21 -i gvcf=dv_platinum6_chr21_gvcf/NA12877.chr21.gvcf.gz -i gvcf=dv_platinum6_chr21_gvcf/NA12878.chr21.gvcf.gz -i gvcf=dv_platinum6_chr21_gvcf/NA12891.chr21.gvcf.gz -i gvcf=dv_platinum6_chr21_gvcf/NA12890.chr21.gvcf.gz -i gvcf=dv_platinum6_chr21_gvcf/NA12889.chr21.gvcf.gz -i gvcf=dv_platinum6_chr21_gvcf/NA12892.chr21.gvcf.gz         -y --watch

GLnexus echoes the active configuration YAML into the job log so you can double-check the required_dp setting.

HTH & I'm sorry again for missing the inquiry before -- too many gmail accounts these days :(

Thank You very much ! On applet at DNAnexus.com it worked good. in the .yml file You provided I changed only DP settings. However, I think GLnexus still counts those with DP below threshold for AF calculation:

chr1 1708855 chr1_1708855_TTCCTCCTCC_T TTCCTCCTCC T 41 . AF=0.25;AQ=41 GT:DP:AD:GQ:PL:RNC ./.:3:3,0:9:.:.. 0/0:14:3,11:8:0,30,6:.. 0/0:73:73,0:50:.:.. 1/1:18:0,18:40:41,48,0:..
chr1 1709369 chr1_1709369_C_T C T 26 . AF=0.125;AQ=26 GT:DP:AD:GQ:PL:RNC ./.:.:.:.:.:MM ./.:.:.:.:.:MM 0/1:14:6,8:26:26,0,36:.. ./.:0:0,0:1:.:.. ``

e.g. the second variant is called only in one sample as a heterozygous genotype, although AF is 0.125.

Is this intended usage or it could be fixed somehow?

mlin commented 5 years ago

@bopohdr Yea the AF INFO field is just copy number divided by 2N.

https://github.com/dnanexus-rnd/GLnexus/blob/e25a6be613fbd1879477d62c7258a2dcfe8dd896/src/unifier.cc#L530

Sorry if this isn't what you want...it has come up in discussions before and I'm open to feedback on whether the denominator should subtract non-calls. My argument is that both of these are flawed estimators with undesirable corner cases, so I prefer the simpler one. If one is really concerned with the best possible estimator then it should use the likelihood function on the het/hom genotype counts and a prior distribution on the allele frequency.

bopohdr commented 5 years ago

@mlin Thank You. I understand that there are flaws in both approaches of frequency calculations. As I understand it is not possible to set option for calculating frequency without non-calls in ".yml" file, which is provided for online app ? About feedback: maybe it also would be useful to have output option similar to mutation frequency in gnomad (1543 out of 245490 (that were covered at the position)).