@vruano and I both looked at this and we can't figure out why this globalPrior (by default 0.001 = SNP heterzygosity) is being added to ALL alleles (alt and ref). If no other AC priors are used, this seems to have the potential to apply a high prior to skew the genotypes in favor of homozygotes.
I'll look at this while I'm doing the analysis for CGP in production recommendations.
Actually @davidbenjamin since you've been thinking a lot about genotypes and priors recently, do you think you'll have time to take a look? Chris Hartl did the model, so I don't know all the details, but it shouldn't be too hard to untangle.
I agree with @ldgauthier and @vruano. The same prior pseudocount is being added to all alleles, alt and ref, and this is incorrect. Since the default pseudocount is 0.001 this is a symmetric sparse prior which, as @ldgauthier said, favors a small number of alleles and hence biases in favor of homozygous genotypes.
A more correct thing to do -- not necessarily the most correct, but not manifestly wrong -- would be to give 1000 prior pseudocounts to the ref allele and 1 prior pseudocount to each alt allele (for SNPS, at least).
A couple more things I noticed in this code:
PosteriorLikelihoodsUtils has a nomenclature problem. Likelihoods, priors, and posteriors are getting conflated and combined in various amusing permutations, chief among them being that a posterior likelihood is an oxymoron.
It seems like there may be an overcounting problem in PosteriorLikelihoodsUtils::calculatePosteriorGLs. If I follow correctly, lines 82-100 get the total counts for each allele in the resources and in the input, via the MLEAC or AC fields, and add the prior pseudocounts. Lines 111-144 grab the genotype likelihoods from the input. In line 149 we pass the allele counts and likelihoods to calculatePosteriorGLs. This method uses the allele counts (prior + resources + input AC) to define a Dirichlet distribution which then serves as the prior on genotypes. Finally, on line 203 we multiply (add in log space) this prior by the genotype likelihoods to get the posterior probabilities on genotypes. It appears to me that we have double-counted the input data, once to get its AC field and once to get its GLs. I believe the correct thing to do is use only the resources to define a prior which is then combined with the GLs to get the posterior.
I will take the blame (both figuratively and the literal git blame) for PosteriorLikelihoodsUtils nomenclature problems. I either initiated them or didn't fix them when I refactored.
I also intuitively prefer resources only without using the input AC, but that being said we've seen better results using both, specifically for a Finnish cohort with 100 founders. In the DSDE/ATGU meetings the use of the input AC was discussed as being analogous to a single step of EM. Would the true EM apply a different update for each sample in the callset?
The better results using the double-counting might have something to do with the incorrect prior -- if the prior is skewing toward homozygosity, then double-counting your variant data might counteract this and rescue some variant genotypes, which will be mainly hets.
The EM model that people implicitly seem to have in mind is alternating E steps on each sample to get genotype posteriors with M steps to learn the allele frequencies. So let's work out what happens if you do just one iterations:
0) Initialize allele frequencies to the mean of the Dirichlet heterozygosity prior; i.e. ~1 for ref, ~1/1000 for each alt, plus any allele counts from the resources. Genotype priors come from the multinomial distribution (one genotype is a draw of 2 alleles) of these allele frequencies.
1) (E step) genotype posteriors are the product of genotype likelihoods with the priors from step 0). Pseudocounts are the sum of expected posterior allele counts.
2) (M step) MLE allele frequencies are the mode of the Dirichlet parameterized by the sum of the original step 0) prior+resources pseudocounts with the E step pseudocounts from step 1).
Hmmm that does sound a lot like what the code is doing now. I suppose it's defensible after all.
So it sounds to me like the action item here is to fix the Dirichlet heterozygosity prior. I like the idea of adding one count for the ref and 1/1000 for each alt (rather than, for example, 1000 for ref and one for alt) so the heterozygosity prior does something in the absence of external resource counts, but doesn't overwhelms them if they are present. @davidbenjamin Can you think of a more rigorous justification for the scaling of counts between sample genotype allele counts and the heterozygosity?
@ldgauthier commented on Wed Oct 21 2015
@vruano and I both looked at this and we can't figure out why this globalPrior (by default 0.001 = SNP heterzygosity) is being added to ALL alleles (alt and ref). If no other AC priors are used, this seems to have the potential to apply a high prior to skew the genotypes in favor of homozygotes.
I'll look at this while I'm doing the analysis for CGP in production recommendations.
@vdauwera commented on Tue May 17 2016
Is this still a thing?
@ldgauthier commented on Tue May 17 2016
It is still a thing in that is has been untouched and uninvestigated. I might get some time this week.
@ldgauthier commented on Tue May 17 2016
Actually @davidbenjamin since you've been thinking a lot about genotypes and priors recently, do you think you'll have time to take a look? Chris Hartl did the model, so I don't know all the details, but it shouldn't be too hard to untangle.
@davidbenjamin commented on Tue May 17 2016
@ldgauthier Sure, I can take a look.
@davidbenjamin commented on Tue May 17 2016
I agree with @ldgauthier and @vruano. The same prior pseudocount is being added to all alleles, alt and ref, and this is incorrect. Since the default pseudocount is 0.001 this is a symmetric sparse prior which, as @ldgauthier said, favors a small number of alleles and hence biases in favor of homozygous genotypes.
A more correct thing to do -- not necessarily the most correct, but not manifestly wrong -- would be to give 1000 prior pseudocounts to the ref allele and 1 prior pseudocount to each alt allele (for SNPS, at least).
A couple more things I noticed in this code:
PosteriorLikelihoodsUtils
has a nomenclature problem. Likelihoods, priors, and posteriors are getting conflated and combined in various amusing permutations, chief among them being that a posterior likelihood is an oxymoron.PosteriorLikelihoodsUtils::calculatePosteriorGLs
. If I follow correctly, lines 82-100 get the total counts for each allele in the resources and in the input, via the MLEAC or AC fields, and add the prior pseudocounts. Lines 111-144 grab the genotype likelihoods from the input. In line 149 we pass the allele counts and likelihoods tocalculatePosteriorGLs
. This method uses the allele counts (prior + resources + input AC) to define a Dirichlet distribution which then serves as the prior on genotypes. Finally, on line 203 we multiply (add in log space) this prior by the genotype likelihoods to get the posterior probabilities on genotypes. It appears to me that we have double-counted the input data, once to get its AC field and once to get its GLs. I believe the correct thing to do is use only the resources to define a prior which is then combined with the GLs to get the posterior.@ldgauthier commented on Tue May 17 2016
I will take the blame (both figuratively and the literal git blame) for PosteriorLikelihoodsUtils nomenclature problems. I either initiated them or didn't fix them when I refactored.
I also intuitively prefer resources only without using the input AC, but that being said we've seen better results using both, specifically for a Finnish cohort with 100 founders. In the DSDE/ATGU meetings the use of the input AC was discussed as being analogous to a single step of EM. Would the true EM apply a different update for each sample in the callset?
@davidbenjamin commented on Tue May 17 2016
The better results using the double-counting might have something to do with the incorrect prior -- if the prior is skewing toward homozygosity, then double-counting your variant data might counteract this and rescue some variant genotypes, which will be mainly hets.
The EM model that people implicitly seem to have in mind is alternating E steps on each sample to get genotype posteriors with M steps to learn the allele frequencies. So let's work out what happens if you do just one iterations:
0) Initialize allele frequencies to the mean of the Dirichlet heterozygosity prior; i.e. ~1 for ref, ~1/1000 for each alt, plus any allele counts from the resources. Genotype priors come from the multinomial distribution (one genotype is a draw of 2 alleles) of these allele frequencies. 1) (E step) genotype posteriors are the product of genotype likelihoods with the priors from step 0). Pseudocounts are the sum of expected posterior allele counts. 2) (M step) MLE allele frequencies are the mode of the Dirichlet parameterized by the sum of the original step 0) prior+resources pseudocounts with the E step pseudocounts from step 1).
Hmmm that does sound a lot like what the code is doing now. I suppose it's defensible after all.
@ldgauthier commented on Thu May 19 2016
So it sounds to me like the action item here is to fix the Dirichlet heterozygosity prior. I like the idea of adding one count for the ref and 1/1000 for each alt (rather than, for example, 1000 for ref and one for alt) so the heterozygosity prior does something in the absence of external resource counts, but doesn't overwhelms them if they are present. @davidbenjamin Can you think of a more rigorous justification for the scaling of counts between sample genotype allele counts and the heterozygosity?
@vdauwera commented on Mon Nov 14 2016
Is this still alive? To be continued in GATK3 or 4?
@ldgauthier commented on Tue Nov 15 2016
You can move to GATK4.