AlphaGenes / AlphaPeel

AlphaPeel: calling, phasing, and imputing genotype and sequence data in pedigrees
MIT License
2 stars 11 forks source link

Implement alt_allele_prob option #142

Open AprilYUZhang opened 6 months ago

AprilYUZhang commented 6 months ago

So we can change allele freq in the founders (not setting to 0.5 and not estimating from the data)

RosCraddock commented 6 months ago

@XingerTang - assigned to this issue.

XingerTang commented 6 months ago

@RosCraddock @gregorgorjanc

Coding tasks to do:

gregorgorjanc commented 5 months ago

I spoke with @AprilYUZhang today and she pointed out some bits that I wan't to clarify here.

Current state in AlphaPeel is the following:

  1. Estimate alternative allele probability based on observed genotyped data (using Newton method on genotype probabilities from observed genotypes (something like this), so accounting for genotyping error)
  2. Take 1. and set it for the rest of the program execution
  3. Use 1. to set anterior term for founders (since alternative allele probability is fixed, so are these anterior terms)
  4. Peel down and up a couple of times to propagate observed genotype information across pedigree

There are two issues with the above: a) in 1. we are estimating alternative allele probability for an "undefined" population (we take any observed genotypes in pedigree), while we really need base population alternative allele probability - while the estimate based on the "undefined" population is not the base population estimate, it probably isn’t miles off, but see also c) b) as we discussed in person, a) will not do what we need for metafounders, but see also c) c) once we get the estimate, keeping it fixed might not be what we want - even if we have slightly off estimate from 1. if we use it as a starting value and then update the base population alternative allele probability by estimating it from inferred individual genotype probabilities for just the founders then we could converge to a better solution - this might make the running time of AlphaPeel longer / we might need more peeling runs - at the moment we effectively use a simple estimate and fix it, so given that estimate we then estimate individual genotype probs - this starting value and convergence thing could actually well work for more than one metafounder too, so there is hope for b) too

The above suggests that we would like to end up in this "correct" state:

  1. Estimate alternative allele probability based on observed genotyped data (using Newton method on genotype probabilities from observed genotypes (something like this), so accounting for genotyping error) --> test how the linear model method with genetic groups could serve us better, but note that even a starting value and updates in the founders could work well, so I suggest we do this linear model method last
  2. Take 1. and set it for the rest of the program execution --> I would like us to explore updating base population allele probability with every round of peeling (we start going down and then up, so when we come up, we have genotype probs for founders and we can estimate allele prob there, even separated by multiple metafounders)
  3. Use 1. to set anterior term for founders (since alternative allele probability is fixed, so are these anterior terms) --> implementing change in 2. means we would update anetrior term for founders every iteration too
  4. Peel down and up a couple of times to propagate observed genotype information across pedigree --> hopefully the above changes would not make the algorithm/runtime much slower (as in, that we would need more iterations)
XingerTang commented 5 months ago

I spoke with @AprilYUZhang today and she pointed out some bits that I wan't to clarify here.

Current state in AlphaPeel is the following:

  1. Estimate alternative allele probability based on observed genotyped data (using Newton method on genotype probabilities from observed genotypes (something like this), so accounting for genotyping error)
  2. Take 1. and set it for the rest of the program execution
  3. Use 1. to set anterior term for founders (since alternative allele probability is fixed, so are these anterior terms)
  4. Peel down and up a couple of times to propagate observed genotype information across pedigree

There are two issues with the above: a) in 1. we are estimating alternative allele probability for an "undefined" population (we take any observed genotypes in pedigree), while we really need base population alternative allele probability - while the estimate based on the "undefined" population is not the base population estimate, it probably isn’t miles off, but see also c) b) as we discussed in person, a) will not do what we need for metafounders, but see also c) c) once we get the estimate, keeping it fixed might not be what we want - even if we have slightly off estimate from 1. if we use it as a starting value and then update the base population alternative allele probability by estimating it from inferred individual genotype probabilities for just the founders then we could converge to a better solution - this might make the running time of AlphaPeel longer / we might need more peeling runs - at the moment we effectively use a simple estimate and fix it, so given that estimate we then estimate individual genotype probs - this starting value and convergence thing could actually well work for more than one metafounder too, so there is hope for b) too

The above suggests that we would like to end up in this "correct" state:

  1. Estimate alternative allele probability based on observed genotyped data (using Newton method on genotype probabilities from observed genotypes (something like this), so accounting for genotyping error) --> test how the linear model method with genetic groups could serve us better, but note that even a starting value and updates in the founders could work well, so I suggest we do this linear model method last
  2. Take 1. and set it for the rest of the program execution --> I would like us to explore updating base population allele probability with every round of peeling (we start going down and then up, so when we come up, we have genotype probs for founders and we can estimate allele prob there, even separated by multiple metafounders)
  3. Use 1. to set anterior term for founders (since alternative allele probability is fixed, so are these anterior terms) --> implementing change in 2. means we would update anetrior term for founders every iteration too
  4. Peel down and up a couple of times to propagate observed genotype information across pedigree --> hopefully the above changes would not make the algorithm/runtime much slower (as in, that we would need more iterations)

@gregorgorjanc Thank you for summarizing this! There is just one point I would like to clarify. In steps 2 and 3 of the "correct" state, you mentioned that we would update the estimation of alternative allele probability every peeling cycle, and use the updated allele probability to reestimate the anterior terms. But, we had a conversation about the information contained in the updated alternative allele probability, which is the same as the information contained in the anterior terms after each peeling cycle. If we reestimate anterior terms based on the updated alternative allele probability, it would be the same as the one before the reestimation. So we probably would only do the estimation at the very beginning of the whole peeling process for the peeling accuracy and the reestimation at the very end of the peeling process for the more accurate alternative allele probability output.

gregorgorjanc commented 5 months ago

@gregorgorjanc Thank you for summarizing this! There is just one point I would like to clarify. In steps 2 and 3 of the "correct" state, you mentioned that we would update the estimation of alternative allele probability every peeling cycle, and use the updated allele probability to reestimate the anterior terms. But, we had a conversation about the information contained in the updated alternative allele probability, which is the same as the information contained in the anterior terms after each peeling cycle. If we reestimate anterior terms based on the updated alternative allele probability, it would be the same as the one before the reestimation. So we probably would only do the estimation at the very beginning of the whole peeling process for the peeling accuracy and the reestimation at the very end of the peeling process for the more accurate alternative allele probability output

@XingerTang right, I keep forgetting that with the addition of metafounders the founders of the new internal pedigree are the metafounders which are “parents” of all our actual founding individuals! Let’s see … so, these metafounders will have anterior, penetrance, and “posterior” terms. When we have a starting allele prob (passed by user or estimated from the data) we should use that for the anterior term of the metafounder(s). Then we peel down and up the pedigree. Once we come up, we will have estimated individual genotype probabilities for the metafounder(s) by combining the anterior and “posterior” terms (the “posterior” term will collect all the information from all descendants of each metafounder) while penetrance will always be unknown for metafounders (unless we have some prior information). These estimated individual genotype probabilities for the metafounder(s) are in fact estimated base population genotype probabilities and we can simply convert these to estimate the base population allele frequency (possibly for more than one metafounder). Having this estimate, we can update the anterior term of the metafounder(s) and repeat peeling down and up. There will be a cycle/loop of information flow so we will have to test how it works in terms of accuracy and runtime till convergence (we might need to add actual convergence metric!). How does this sound?

XingerTang commented 5 months ago

@gregorgorjanc Sure, it sounds doable.