Open AprilYUZhang opened 6 months ago
@XingerTang - assigned to this issue.
@RosCraddock @gregorgorjanc
Coding tasks to do:
tinyhouse.pedigree
to store the information of the metafounders while reading in the pedigree file
MetaFounder
flag/attribute to Individual
class, while reading in the pedigree data, set the flag to True
for each individual with their names start with MF_
(can also check if metafounders are actually founders, raise errors if not)MF_1
individual to the individual list of the Pedigree
object after the input pedigree data is readPedigree.readInPedigree
so that for each individual in the pedigree
if both parents == None None and not MetaFounder:
set MF_1 as parents
alphapeel.peelinginfo
to store the corresponding alternative allele frequency for each of the metafounders
jit_peelingInformation.nMF
to store the number of metafounders in the populationjit_peelingInformation.MFList
to store the list of the metafounder individuals (or their ids)jit_peelingInformation.maf
as an nMF
$\times$ nLoci
numpy matrixalphapeel.peelinginfo
to allow user-defined alternative allele frequency to be used in the calculation
alphapeel.tinypeel
to add the option alt_allele_prob
alt_allele_prob
alt_allele_prob
and est_alt_prob
are used (GG: in that case we use alt_allele_prob
as a starting value for est_alt_prob
)I spoke with @AprilYUZhang today and she pointed out some bits that I wan't to clarify here.
Current state in AlphaPeel is the following:
There are two issues with the above: a) in 1. we are estimating alternative allele probability for an "undefined" population (we take any observed genotypes in pedigree), while we really need base population alternative allele probability - while the estimate based on the "undefined" population is not the base population estimate, it probably isn’t miles off, but see also c) b) as we discussed in person, a) will not do what we need for metafounders, but see also c) c) once we get the estimate, keeping it fixed might not be what we want - even if we have slightly off estimate from 1. if we use it as a starting value and then update the base population alternative allele probability by estimating it from inferred individual genotype probabilities for just the founders then we could converge to a better solution - this might make the running time of AlphaPeel longer / we might need more peeling runs - at the moment we effectively use a simple estimate and fix it, so given that estimate we then estimate individual genotype probs - this starting value and convergence thing could actually well work for more than one metafounder too, so there is hope for b) too
The above suggests that we would like to end up in this "correct" state:
I spoke with @AprilYUZhang today and she pointed out some bits that I wan't to clarify here.
Current state in AlphaPeel is the following:
- Estimate alternative allele probability based on observed genotyped data (using Newton method on genotype probabilities from observed genotypes (something like this), so accounting for genotyping error)
- Take 1. and set it for the rest of the program execution
- Use 1. to set anterior term for founders (since alternative allele probability is fixed, so are these anterior terms)
- Peel down and up a couple of times to propagate observed genotype information across pedigree
There are two issues with the above: a) in 1. we are estimating alternative allele probability for an "undefined" population (we take any observed genotypes in pedigree), while we really need base population alternative allele probability - while the estimate based on the "undefined" population is not the base population estimate, it probably isn’t miles off, but see also c) b) as we discussed in person, a) will not do what we need for metafounders, but see also c) c) once we get the estimate, keeping it fixed might not be what we want - even if we have slightly off estimate from 1. if we use it as a starting value and then update the base population alternative allele probability by estimating it from inferred individual genotype probabilities for just the founders then we could converge to a better solution - this might make the running time of AlphaPeel longer / we might need more peeling runs - at the moment we effectively use a simple estimate and fix it, so given that estimate we then estimate individual genotype probs - this starting value and convergence thing could actually well work for more than one metafounder too, so there is hope for b) too
The above suggests that we would like to end up in this "correct" state:
- Estimate alternative allele probability based on observed genotyped data (using Newton method on genotype probabilities from observed genotypes (something like this), so accounting for genotyping error) --> test how the linear model method with genetic groups could serve us better, but note that even a starting value and updates in the founders could work well, so I suggest we do this linear model method last
- Take 1. and set it for the rest of the program execution --> I would like us to explore updating base population allele probability with every round of peeling (we start going down and then up, so when we come up, we have genotype probs for founders and we can estimate allele prob there, even separated by multiple metafounders)
- Use 1. to set anterior term for founders (since alternative allele probability is fixed, so are these anterior terms) --> implementing change in 2. means we would update anetrior term for founders every iteration too
- Peel down and up a couple of times to propagate observed genotype information across pedigree --> hopefully the above changes would not make the algorithm/runtime much slower (as in, that we would need more iterations)
@gregorgorjanc Thank you for summarizing this! There is just one point I would like to clarify. In steps 2 and 3 of the "correct" state, you mentioned that we would update the estimation of alternative allele probability every peeling cycle, and use the updated allele probability to reestimate the anterior terms. But, we had a conversation about the information contained in the updated alternative allele probability, which is the same as the information contained in the anterior terms after each peeling cycle. If we reestimate anterior terms based on the updated alternative allele probability, it would be the same as the one before the reestimation. So we probably would only do the estimation at the very beginning of the whole peeling process for the peeling accuracy and the reestimation at the very end of the peeling process for the more accurate alternative allele probability output.
@gregorgorjanc Thank you for summarizing this! There is just one point I would like to clarify. In steps 2 and 3 of the "correct" state, you mentioned that we would update the estimation of alternative allele probability every peeling cycle, and use the updated allele probability to reestimate the anterior terms. But, we had a conversation about the information contained in the updated alternative allele probability, which is the same as the information contained in the anterior terms after each peeling cycle. If we reestimate anterior terms based on the updated alternative allele probability, it would be the same as the one before the reestimation. So we probably would only do the estimation at the very beginning of the whole peeling process for the peeling accuracy and the reestimation at the very end of the peeling process for the more accurate alternative allele probability output
@XingerTang right, I keep forgetting that with the addition of metafounders the founders of the new internal pedigree are the metafounders which are “parents” of all our actual founding individuals! Let’s see … so, these metafounders will have anterior, penetrance, and “posterior” terms. When we have a starting allele prob (passed by user or estimated from the data) we should use that for the anterior term of the metafounder(s). Then we peel down and up the pedigree. Once we come up, we will have estimated individual genotype probabilities for the metafounder(s) by combining the anterior and “posterior” terms (the “posterior” term will collect all the information from all descendants of each metafounder) while penetrance will always be unknown for metafounders (unless we have some prior information). These estimated individual genotype probabilities for the metafounder(s) are in fact estimated base population genotype probabilities and we can simply convert these to estimate the base population allele frequency (possibly for more than one metafounder). Having this estimate, we can update the anterior term of the metafounder(s) and repeat peeling down and up. There will be a cycle/loop of information flow so we will have to test how it works in terms of accuracy and runtime till convergence (we might need to add actual convergence metric!). How does this sound?
@gregorgorjanc Sure, it sounds doable.
So we can change allele freq in the founders (not setting to 0.5 and not estimating from the data)