AlphaGenes / AlphaPeel

AlphaPeel: calling, phasing, and imputing genotype and sequence data in pedigrees
MIT License
2 stars 11 forks source link

Formula modificaiton for the estimation of recombinaiton rate #162

Open AprilYUZhang opened 2 months ago

AprilYUZhang commented 2 months ago

In the paper, the original formula is as follows: Screenshot 2024-04-05 at 15 58 23 r : recombiniation, i : i-th individual, j : j-th locus, seg_i,j : the segregation status of i-th individual and j-th locus.

Two issues are ambiguous.

  1. How many loops does it carry on? If we regard looking for recombination events as comparing differences in segregation between two adjacent loci, it should be two loops. Otherwise, it is not two adjacent loci. However, if we admit it should be two loops, we couldn't solve the loss of p(seg_j,j+1). That's a controversy.
  2. What is the segregation probability? Should it be "a" float or a vector with a 1x4 dimension? According to the results, it should be a vector with a 1x4 dimension, so I think the indicator function is more likely a matrix with a 4x4 dimension (which is why I wrote indicator function A). We also regard the segregation probability as the maximum value, so seg_i,j will be the segregation status corresponding to the maximum segregation probability. Based on this explanation, it seems to solve this issue.

Further, is the conditional probability necessary? whether we can simplify them? In my perspective, the recombination rate is similar to the genetype error rate. When the peeling process is done, these parameters will be updated according to the output and then input to the next cycle. So they can be calculated, not based on the peeling. In other words, the peeling process has been completed, and the calculation of the recombination rate is supposed to focus on adjacent loci.

Another interesting explanation is that this recombination rate is not the total recombination rate but rather the recob rate per locus. The segregation status of two adjacent loci will have 16 possibilities; this formula manages to sum the 16 possibilities. Screenshot 2024-04-05 at 16 20 36 But it still loses the p(seg_j,j+1). From this perspective, except for the problem of "p(seg_j,j+1)" and the conditional probability, I came up with an alternative.

segregation status means the "pp","mp","pm","mm".

AprilYUZhang commented 2 months ago

The formula I modified is as follows: Screenshot 2024-04-05 at 16 26 59

It is easy to see that I use the basic frame of the second formula, remove the conditional function, and multiply p(seg_i,j+1). This formula can be converted to the following one: Screenshot 2024-04-05 at 16 59 46 Screenshot 2024-04-05 at 17 49 29

AprilYUZhang commented 2 months ago

I used the example in the branch "feat_recomb_new" test whether the function works. Screenshot 2024-04-05 at 17 20 20

The true recombination event is Screenshot 2024-04-05 at 17 25 43 The column is the gap between adjacent loci, and the row is individual.

My code: Screenshot 2024-04-05 at 17 34 41

The output matrix is equal to "true recombination event "

Because the Default of segregation probability is [0.25, 0.25, 0.25, 0.25]. For two adjacent loci, basic formula will output 1. In theory, it is correct, but in practice, we prefer to regard them as having the same segregation status.

And the locus 1 and locus 2 of M2 individuals also have similar problems.

Screenshot 2024-04-05 at 17 42 41

I think we should focus on the change between loci, so I set all recombination event with totally consistent segregation possibilities.

XingerTang commented 2 months ago

@AprilYUZhang Thank you for the implementation! The result looks good!

I want to share a few thoughts on the first equation given in the paper that might answer some of your questions. The equation is as follows:

$$ \begin{equation} \gamma = \frac{1}{n} \sumi \sum{seg{i, j}} \sum{seg{i, j + 1}} I(seg{i, j} \neq seg{i, j + 1}) p(seg{i, j} | seg{i, j - 1}) p(seg{i, j}, g_i, g_f, gm) p(seg{i, j + 1} | seg_{i, j +2}). \end{equation} $$

Here, every term of the sum, including $p(seg{i, j} | seg{i, j - 1})$, are scalars. The equation is for the average recombination rate between loci $j$ and $j + 1$ across all the individuals. When the calculation of the recombination rate is implemented, all of the segregation probabilities at the current step of the iteration algorithm are known. For each individual $i$, the variables $seg{i, j}$ and $seg{i, j + 1}$ could be one of ${pp, pm, mp, mm}$. So in total, there are $16$ terms to sum. However, $I(seg{i, j} \neq seg{i, j + 1})$ indicates only the $14$ terms such that the segregation statuses at the two loci are different are taken into account. The two conditional probabilities $p(seg{i, j} | seg{i, j - 1})$ and $p(seg{i, j + 1} | seg{i, j + 2})$ can be calculated via the segregation information at loci $j - 1$, $j + 1$, and $j + 2$ at the current iteration, and the recombination rate between loci $j - 1$ and $j$ and the rate between $j + 1$ and $j + 2$ calculated at the previous iteration. The term $p(seg_{i, j}, g_i, g_f, gm)$ is the segregation probability for the specific $seg{i, j}$.

The reason why $p(seg{i, j} | seg{i, j + 1})$ does not appear is because it involves the recombination rate which is what we are calculating. This sophisticated equation particularly allows the propagation of information across iterations so we can use the results we found before and also allows the propagation of information across all loci so that we can determine the recombination rate with the knowledge of all the neighboring loci.

Your algorithm for the recombination rate is correct, but it assumes that we have full knowledge of the genotypes, which is impossible in practice. It is a good interpreter of the data we have, but it doesn't infer anything more than what we currently have. Depending on what is our goal, whether we want another statistic of the current data or a better recombination rate estimation to improve the algorithm, we may stop here or explore other ways to take into account the uncertainties that lie in the data.

AprilYUZhang commented 2 months ago

@XingerTang I agree that the true value couldn't be estimated in practice. As your explanation, although some issues still exist, whether we need to simplify it is that whether it must be part of iteration. At same time, we should add this estimation in "Peeling.py" or "PeelingUpdateds.py". I prefer to estimate directly based on current segregation probability; in other word, r should be updated in "PeelingUpdates.py", like genetype error, etc. But it does not mean the calculation of recombination will not consider the information of adjacent loci and pedigree. When we calculate the segregation probability, it should be considered. In theory, the more accurate the segregation probability, the more accurate the recombination rate.

Or this new formula could look like this:

Screenshot 2024-04-08 at 17 53 17

k is the pp,pm,mp,mm

Due to

Screenshot 2024-04-08 at 18 00 52

So

Screenshot 2024-04-08 at 18 14 16

And due to

Screenshot 2024-04-08 at 18 25 33

So this formula will have more complex expression if countinue.

In total, I think segregation probability is a "directly estimated" parameter in iteration, while recombination should be a "indirectly estimated" parameter which can be estiamted from segregation probability.

AprilYUZhang commented 2 months ago

I tested again using the "accuracy_test". It supports your view. I read the true segregation probability "true-seg_prob.txt" and the estimated segregation probability "multi_seq_file/.seg_prob.txt". Also, I designed the two new recombination calculations. One is

Screenshot 2024-04-08 at 16 37 04

excluding defualt value. Another is the difference between adjacent loci.


# One
 def recom_f1(seg_prob_array,nInd,nLociAll):
    default_array=np.array([0.25,0.25,0.25,0.25])
    # Design indicator function
    indicator =np.array([[0, 1 , 1 , 2],
    [1 , 0 , 2 , 1],
    [1 , 2 , 0 , 1],
    [2 , 1 , 1 , 0]])
    # initialize the recombination event array
    recom=np.empty((nInd,nLociAll-1))
    for k in range(0,nLociAll-1):
        for j in range(0,nInd):
            # avoid that the dedault value causes the distortion of recombination event
            if np.all(seg_prob_array[j,k]==default_array) and np.all(seg_prob_array[j,k+1]==default_array) :
                recom[j,k]=0
            else:
                recom[j,k]=np.dot(np.dot(seg_prob_array[j,k],indicator),seg_prob_array[j,k+1])
    return recom

# Another
def recom_f2(seg_prob_array,nInd,nLociAll):
    # Design indicator function
    # initialize the recombination event array
    recom=np.empty((nInd,nLociAll-1))
    for k in range(0,nLociAll-1):
        for j in range(0,nInd):
            recom[j,k]=np.sum(np.abs(seg_prob_array[j,k]-seg_prob_array[j,k+1]))/2
            # avoid that the dedault value causes the distortion of recombination event
    return recom
AprilYUZhang commented 2 months ago
Data Method Recombination events Run time total Recombination rate
True seg One 1636 6.2666 s 0.000409
Estimated seg One 939,659 (rec_prob>0) 6.3989 s 0.0153
Estimated seg Second 496,372 (rec_prob>0) 4.6847 s 0.000610

It looks bad. So I managed to look through the details. The 201st individual and 1717th locus have recombination, and the results are as follows:

seg_prob_array_true: loc-1 loc loc+1 loc+2
[1. 0. 0. 0.] [1. 0. 0. 0.] [0. 1. 0. 0.] [0. 1. 0. 0.]
seg_prob_array: loc-1 loc loc+1 loc+2
[2.875e-01 7.112e-01 4.000e-04 9.000e-04] [3.330e-01 6.657e-01 5.000e-04 8.000e-04] [3.796e-01 6.192e-01 5.000e-04 7.000e-04] [3.796e-01 6.192e-01 5.000e-04 7.000e-04]

For this point true recombination probability : 1.0 recombination probability (Mothod 1): 0.46257018 recombination probability (Mothod 2): 0.04659999999999998

Screenshot 2024-04-08 at 17 10 40

This figure shows the recombination probability for one individual (the 201th). The pattens of peak fit the true recombination event.

Screenshot 2024-04-08 at 17 16 44

This figure shows the recombination probability for one individual (the 211th). Although it is not a perfect match, there will be fluctuations in the predicted value near each real recombination event. It is normal, the accuracy of the estimation of segregation for first- and second generation groups is only around 0.1.

But when we see the results of the 1000th individual, It is clear.

Screenshot 2024-04-08 at 17 24 33

From the above results, I am more inclined to think that we should not simply average all recombination probabilities as the recombination rate but predict the recombination event using “recombination probabilities.”.

XingerTang commented 2 months ago

@XingerTang I agree that the true value couldn't be estimated in practice. As your explanation, although some issues still exist, whether we need to simplify it is that whether it must be part of iteration. At same time, we should add this estimation in "Peeling.py" or "PeelingUpdateds.py". I prefer to estimate directly based on current segregation probability; in other word, r should be updated in "PeelingUpdates.py", like genetype error, etc. But it does not mean the calculation of recombination will not consider the information of adjacent loci and pedigree. When we calculate the segregation probability, it should be considered. In theory, the more accurate the segregation probability, the more accurate the recombination rate.

Or this new formula could look like this:

Screenshot 2024-04-08 at 17 53 17 k is the pp,pm,mp,mm Due to Screenshot 2024-04-08 at 18 00 52 So Screenshot 2024-04-08 at 18 14 16

And due to Screenshot 2024-04-08 at 18 25 33

So this formula will have more complex expression if countinue.

In total, I think segregation probability is a "directly estimated" parameter in iteration, while recombination should be a "indirectly estimated" parameter which can be estiamted from segregation probability.

@AprilYUZhang In my understanding, in the implementation of the equation for the recombination rate proposed in the paper, there shouldn't be any matrix involved. I don't quite understand what $r{i, j}$ here stands for. And, the terms you expanded also won't be used as a part of the calculation. For each term of the sum, $p(seg{i, j}, g_i, g_f, gm)$ is the segregation probability at the current iteration, $p(seg{i, j + 1} | seg_{i, j + 2})$ can be calculated by

$$ \begin{align} p(seg{i, j + 1} | seg{i, j + 2}) &= \frac{p(seg{i, j + 1} \cap seg{i, j + 2})}{p(seg{i, j + 2})} \ &= \frac{\gamma p(seg{i, j + 1} = seg{i, j + 2}) + (1 - \gamma)p(seg{i, j + 1} \neq seg{i, j + 2})}{p(seg{i, j + 2})} \end{align} $$

which there are several ways to calculate, but none needs further information on the adjacent loci. A similar calculation can also be applied to the other conditional probability used.

From your implementation, it seems that the simple algorithm may overestimate the occurrence of the recombination events. The equation in the paper involves three probability terms to multiply, which by intuition would yield a much smaller recombination rate. So I think this algorithm is still worth implementing.

XingerTang commented 2 months ago

I tested again using the "accuracy_test". It supports your view. I read the true segregation probability "true-seg_prob.txt" and the estimated segregation probability "multi_seq_file/.seg_prob.txt". Also, I designed the two new recombination calculations. One is Screenshot 2024-04-08 at 16 37 04 excluding defualt value. Another is the difference between adjacent loci.

# One
 def recom_f1(seg_prob_array,nInd,nLociAll):
    default_array=np.array([0.25,0.25,0.25,0.25])
    # Design indicator function
    indicator =np.array([[0, 1 , 1 , 2],
    [1 , 0 , 2 , 1],
    [1 , 2 , 0 , 1],
    [2 , 1 , 1 , 0]])
    # initialize the recombination event array
    recom=np.empty((nInd,nLociAll-1))
    for k in range(0,nLociAll-1):
        for j in range(0,nInd):
            # avoid that the dedault value causes the distortion of recombination event
            if np.all(seg_prob_array[j,k]==default_array) and np.all(seg_prob_array[j,k+1]==default_array) :
                recom[j,k]=0
            else:
                recom[j,k]=np.dot(np.dot(seg_prob_array[j,k],indicator),seg_prob_array[j,k+1])
    return recom

# Another
def recom_f2(seg_prob_array,nInd,nLociAll):
    # Design indicator function
    # initialize the recombination event array
    recom=np.empty((nInd,nLociAll-1))
    for k in range(0,nLociAll-1):
        for j in range(0,nInd):
            recom[j,k]=np.sum(np.abs(seg_prob_array[j,k]-seg_prob_array[j,k+1]))/2
            # avoid that the dedault value causes the distortion of recombination event
    return recom

I am a bit confused here as your equation for $r$ doesn't match any of your algorithms. Your first algorithm is your old simple algorithm, and your second algorithm is a modified version of your first algorithm by replacing the indicator with the numerical difference between the two segregation probabilities. Could you explain what is the equation of $r$ for?

AprilYUZhang commented 2 months ago

@XingerTang I agree that the true value couldn't be estimated in practice. As your explanation, although some issues still exist, whether we need to simplify it is that whether it must be part of iteration. At same time, we should add this estimation in "Peeling.py" or "PeelingUpdateds.py". I prefer to estimate directly based on current segregation probability; in other word, r should be updated in "PeelingUpdates.py", like genetype error, etc. But it does not mean the calculation of recombination will not consider the information of adjacent loci and pedigree. When we calculate the segregation probability, it should be considered. In theory, the more accurate the segregation probability, the more accurate the recombination rate. Or this new formula could look like this: Screenshot 2024-04-08 at 17 53 17 k is the pp,pm,mp,mm Due to Screenshot 2024-04-08 at 18 00 52 So Screenshot 2024-04-08 at 18 14 16 And due to Screenshot 2024-04-08 at 18 25 33 So this formula will have more complex expression if countinue. In total, I think segregation probability is a "directly estimated" parameter in iteration, while recombination should be a "indirectly estimated" parameter which can be estiamted from segregation probability.

@AprilYUZhang In my understanding, in the implementation of the equation for the recombination rate proposed in the paper, there shouldn't be any matrix involved. I don't quite understand what ri,j here stands for. And, the terms you expanded also won't be used as a part of the calculation. For each term of the sum, p(segi,j,gi,gf,gm) is the segregation probability at the current iteration, p(segi,j+1|segi,j+2) can be calculated by

p(segi,j+1|segi,j+2)=p(segi,j+1∩segi,j+2)p(segi,j+2)=γp(segi,j+1=segi,j+2)+(1−γ)p(segi,j+1≠segi,j+2)p(segi,j+2)

which there are several ways to calculate, but none needs further information on the adjacent loci. A similar calculation can also be applied to the other conditional probability used.

It sounds good. My formula is not a new formula exactly; it is just a new expression approach, I use the predefined formula in paper to extend my original one. I just want to show more clearly that for every individual and locus, when we calculate whether the separation probability between two loci is the same, it is not as simple as 0 if they are the same, otherwise 1. For example, for your formula that you mentioned p(segi,j+1=segi,j+2) What is your plan to calculate it? they are not only an observed value but also a vector for possibilities. So I guess you mean sum_k(p(segi,j+1,k=segi,j+2,k)), k is the pp,pm,mp,mm. it causes another question, for one individual and two adjanct loci, p(segi,j+1) = [0.1,0.2,0.3,0.4], p(segi,j+2) = [0.3,0.1,0.3,0.3], how can we compare the difference? In the perspective of math, It is simple, we just use p(segi,j+1)-p(segi,j+2), which is my second method I explored. but its total recombination rate is still higher than the real value.

From your implementation, it seems that the simple algorithm may overestimate the occurrence of the recombination events. The equation in the paper involves three probability terms to multiply, which by intuition would yield a much smaller recombination rate. So I think this algorithm is still worth implementing.

I agree that three terms will lead to a much smaller recombination rate, however, I don't think the recombination events are overestimated. As my figure shows, It can indicate that recombination events may happen. For example, Generations 0 and 1 easily get the wrong peak, but it is due to too little information from their parents rather than the recombination calculation formula. Even if we improve our formula to perfection, this issue still exists. But the recombination rate must be overestimated because the probability will repeat itself many times around the real recombination event.

AprilYUZhang commented 2 months ago
@calculate_time
# the method from Xinger
def recom_f3(seg_prob_array,nInd,nLociAll):
    # Design indicator function
    # initialize the recombination event array
    r= 0.0008
    default_array=np.array([0.25,0.25,0.25,0.25])
    # Design indicator function
    indicator =np.array([[0, 1 , 1 , 2],
    [1 , 0 , 2 , 1],
    [1 , 2 , 0 , 1],
    [2 , 1 , 1 , 0]])
    # initialize the recombination event array
    recom=np.empty((nInd,nLociAll-1))
    for k in range(0,nLociAll-1):
        for j in range(0,nInd):
            # avoid that the dedault value causes the distortion of recombination event
            if np.all(seg_prob_array[j,k]==default_array) and np.all(seg_prob_array[j,k+1]==default_array) :
                recom[j,k]=0
            else:
                recom[j,k]=np.dot(np.dot(seg_prob_array[j,k],indicator),seg_prob_array[j,k+1])
    recom_added =np.empty((nInd,nLociAll-1))
    for j in range(0,recom.shape[0]):
        for k in range(0,recom.shape[1]):
            if k==0:
#                 seg_i_jplus2=(r*recom[j,k+1]+(1-r)*(1-recom[j,k+1]))/seg_prob_array[j,k+2]
# because have zero that will lead to recombinaiton inf and nan, temporarily remove it 
                seg_i_jplus2=(r*recom[j,k+1]+(1-r)*(1-recom[j,k+1]))
                seg_i_j_1=1
            elif k == recom.shape[1]-1:
#                 seg_i_j_1=(r*recom[j,k-1]+(1-r)*(1-recom[j,k-1]))/seg_prob_array[j,k-1]
                seg_i_j_1=(r*recom[j,k-1]+(1-r)*(1-recom[j,k-1]))
                seg_i_jplus2=1
            else:
#                 seg_i_j_1=(r*recom[j,k-1]+(1-r)*(1-recom[j,k-1]))/seg_prob_array[j,k-1]
#                 seg_i_jplus2=(r*recom[j,k+1]+(1-r)*(1-recom[j,k+1]))/seg_prob_array[j,k+2]
                seg_i_j_1=(r*recom[j,k-1]+(1-r)*(1-recom[j,k-1]))
                seg_i_jplus2=(r*recom[j,k+1]+(1-r)*(1-recom[j,k+1]))
            recom_added[j,k]=np.nansum(seg_i_jplus2)*recom[j,k]*np.nansum(seg_i_j_1)

    return recom_added
I tried to realise your formula; is it correct? if not, let me know, Thank you. The results are as follows: Data Method Recombination events Run time total Recombination rate
True seg original 1636 5.0966 s 0.000409
True seg based on difference 1636 3.6935 s 0.000409
True seg based on Xinger's formula 1636 17.8770 s 0.000408
Estimated seg original 939,659 (rec_prob>0) 5.0581 s 0.0153
Estimated seg based on difference 496,372 (rec_prob>0) 3.7187 s 0.00061
Estimated seg based on Xinger's formula 939,635 (rec_prob>0) 17.9241 s 0.00656

The result looks not good, may I run something wrong?

XingerTang commented 2 months ago

@XingerTang I agree that the true value couldn't be estimated in practice. As your explanation, although some issues still exist, whether we need to simplify it is that whether it must be part of iteration. At same time, we should add this estimation in "Peeling.py" or "PeelingUpdateds.py". I prefer to estimate directly based on current segregation probability; in other word, r should be updated in "PeelingUpdates.py", like genetype error, etc. But it does not mean the calculation of recombination will not consider the information of adjacent loci and pedigree. When we calculate the segregation probability, it should be considered. In theory, the more accurate the segregation probability, the more accurate the recombination rate. Or this new formula could look like this: Screenshot 2024-04-08 at 17 53 17 k is the pp,pm,mp,mm Due to Screenshot 2024-04-08 at 18 00 52 So Screenshot 2024-04-08 at 18 14 16 And due to Screenshot 2024-04-08 at 18 25 33 So this formula will have more complex expression if countinue. In total, I think segregation probability is a "directly estimated" parameter in iteration, while recombination should be a "indirectly estimated" parameter which can be estiamted from segregation probability.

@AprilYUZhang In my understanding, in the implementation of the equation for the recombination rate proposed in the paper, there shouldn't be any matrix involved. I don't quite understand what ri,j here stands for. And, the terms you expanded also won't be used as a part of the calculation. For each term of the sum, p(segi,j,gi,gf,gm) is the segregation probability at the current iteration, p(segi,j+1|segi,j+2) can be calculated by p(segi,j+1|segi,j+2)=p(segi,j+1∩segi,j+2)p(segi,j+2)=γp(segi,j+1=segi,j+2)+(1−γ)p(segi,j+1≠segi,j+2)p(segi,j+2) which there are several ways to calculate, but none needs further information on the adjacent loci. A similar calculation can also be applied to the other conditional probability used.

It sounds good. My formula is not a new formula exactly; it is just a new expression approach, I use the predefined formula in paper to extend my original one. I just want to show more clearly that for every individual and locus, when we calculate whether the separation probability between two loci is the same, it is not as simple as 0 if they are the same, otherwise 1. For example, for your formula that you mentioned p(segi,j+1=segi,j+2) What is your plan to calculate it? they are not only an observed value but also a vector for possibilities. So I guess you mean sum_k(p(segi,j+1,k=segi,j+2,k)), k is the pp,pm,mp,mm. it causes another question, for one individual and two adjanct loci, p(segi,j+1) = [0.1,0.2,0.3,0.4], p(segi,j+2) = [0.3,0.1,0.3,0.3], how can we compare the difference? In the perspective of math, It is simple, we just use p(segi,j+1)-p(segi,j+2), which is my second method I explored. but its total recombination rate is still higher than the real value.

From your implementation, it seems that the simple algorithm may overestimate the occurrence of the recombination events. The equation in the paper involves three probability terms to multiply, which by intuition would yield a much smaller recombination rate. So I think this algorithm is still worth implementing.

I agree that three terms will lead to a much smaller recombination rate, however, I don't think the recombination events are overestimated. As my figure shows, It can indicate that recombination events may happen. For example, Generations 0 and 1 easily get the wrong peak, but it is due to too little information from their parents rather than the recombination calculation formula. Even if we improve our formula to perfection, this issue still exists. But the recombination rate must be overestimated because the probability will repeat itself many times around the real recombination event.

One possible way to calculate $p(seg{i, j + 1} = seg{i, j + 2})$ is by noting that both $seg{i, j + 1}$ and $seg{i, j + 2}$ are $4$-dimensional vectors, the angle $\theta$ that between the two vectors can be used to quantify the similarity between the two segregation probability vectors and hence used as an estimate of the probability $p(seg{i, j + 1} = seg{i, j + 2})$.

I understand your points, but if the probabilities around the locus where the real recombination event happened are summing up to more than $1$ then we need to normalize it to prevent the overestimation, the difference in scales of the algorithm and the real data is too large to be ignored.

XingerTang commented 2 months ago
@calculate_time
# the method from Xinger
def recom_f3(seg_prob_array,nInd,nLociAll):
    # Design indicator function
    # initialize the recombination event array
    r= 0.0008
    default_array=np.array([0.25,0.25,0.25,0.25])
    # Design indicator function
    indicator =np.array([[0, 1 , 1 , 2],
    [1 , 0 , 2 , 1],
    [1 , 2 , 0 , 1],
    [2 , 1 , 1 , 0]])
    # initialize the recombination event array
    recom=np.empty((nInd,nLociAll-1))
    for k in range(0,nLociAll-1):
        for j in range(0,nInd):
            # avoid that the dedault value causes the distortion of recombination event
            if np.all(seg_prob_array[j,k]==default_array) and np.all(seg_prob_array[j,k+1]==default_array) :
                recom[j,k]=0
            else:
                recom[j,k]=np.dot(np.dot(seg_prob_array[j,k],indicator),seg_prob_array[j,k+1])
    recom_added =np.empty((nInd,nLociAll-1))
    for j in range(0,recom.shape[0]):
        for k in range(0,recom.shape[1]):
            if k==0:
#                 seg_i_jplus2=(r*recom[j,k+1]+(1-r)*(1-recom[j,k+1]))/seg_prob_array[j,k+2]
# because have zero that will lead to recombinaiton inf and nan, temporarily remove it 
                seg_i_jplus2=(r*recom[j,k+1]+(1-r)*(1-recom[j,k+1]))
                seg_i_j_1=1
            elif k == recom.shape[1]-1:
#                 seg_i_j_1=(r*recom[j,k-1]+(1-r)*(1-recom[j,k-1]))/seg_prob_array[j,k-1]
                seg_i_j_1=(r*recom[j,k-1]+(1-r)*(1-recom[j,k-1]))
                seg_i_jplus2=1
            else:
#                 seg_i_j_1=(r*recom[j,k-1]+(1-r)*(1-recom[j,k-1]))/seg_prob_array[j,k-1]
#                 seg_i_jplus2=(r*recom[j,k+1]+(1-r)*(1-recom[j,k+1]))/seg_prob_array[j,k+2]
                seg_i_j_1=(r*recom[j,k-1]+(1-r)*(1-recom[j,k-1]))
                seg_i_jplus2=(r*recom[j,k+1]+(1-r)*(1-recom[j,k+1]))
            recom_added[j,k]=np.nansum(seg_i_jplus2)*recom[j,k]*np.nansum(seg_i_j_1)

    return recom_added

I tried to realise your formula; is it correct? if not, let me know, Thank you. The results are as follows:

Data Method Recombination events Run time total Recombination rate True seg original 1636 5.0966 s 0.000818 True seg based on difference 1636 3.6935 s 0.000818 True seg based on Xinger's formula 1636 17.8770 s 0.00081569532128 Estimated seg original 939,659 (rec_prob>0) 5.0581 s 0.0307 Estimated seg based on difference 496,372 (rec_prob>0) 3.7187 s 0.00121 Estimated seg based on Xinger's formula 939,635 (rec_prob>0) 17.9241 s 0.01313 The result looks not good, may I run something wrong?

@AprilYUZhang The main difference between your implementation and my understanding is the way to calculate the conditional probabilities, specifically, it is the following line of your algorithm: recom[j,k]=np.dot(np.dot(seg_prob_array[j,k],indicator),seg_prob_array[j,k+1])

The problems of the indicator matrix are:

AprilYUZhang commented 2 months ago

Both two problems are related to the algorithm for calculating recommendations from segregation probabilities. The expected value of recombination probability between 2 loci is 2 in biology. So when we calculate the recombination rate, the total number of recombination events is divided by 2nIndividual(nloci-1). (Your question reminds me that I make mistakes in the denominator. I used nIndividual*nloci, I will update table later.) Take the example you mentioned of p(seg_locus 1) = [0.25, 0.25, 0.25, 0.25] and p(seg_locus 2) = [0.25, 0.25, 0.25, 0.25]. they looks the same, but they still have high probability of recombination. If the segregation status of loci 1 is "mm" with 0.25, and the segregation status of loci 2 is "pp" with 0.25, they do have a recombination event. So the same vector doesn't mean they have no recombinaiton. As far as the angle replace indicator matrix, I will try.

XingerTang commented 2 months ago

Both two problems are related to the algorithm for calculating recommendations from segregation probabilities. The expected value of recombination probability between 2 loci is 2 in biology. So when we calculate the recombination rate, the total number of recombination events is divided by 2nIndividual(nloci-1). (Your question reminds me that I make mistakes in the denominator. I used nIndividual*nloci, I will update table later.) Take the example you mentioned of p(seg_locus 1) = [0.25, 0.25, 0.25, 0.25] and p(seg_locus 2) = [0.25, 0.25, 0.25, 0.25]. they looks the same, but they still have high probability of recombination. If the segregation status of loci 1 is "mm" with 0.25, and the segregation status of loci 2 is "pp" with 0.25, they do have a recombination event. So the same vector doesn't mean they have no recombinaiton. As far as the angle replace indicator matrix, I will try.

If the probabilities are truly 0.25, then the occurrence of the recombination events is reasonable, but here the number 0.25 is mostly due to the uncertainty. I am not sure if the dot product would work since it actually just your code with the indicator matrix removed and the computation with the indicator matrix makes more sense as it demonstrates the similarities of different segregation statuses. The only problem that the indicator matrix caused is in your implementation you need to divide it by 2 to make your dot product a proper approximation to the probability.

XingerTang commented 1 month ago
@calculate_time
# the method from Xinger
def recom_f3(seg_prob_array,nInd,nLociAll):
    # Design indicator function
    # initialize the recombination event array
    r= 0.0008
    default_array=np.array([0.25,0.25,0.25,0.25])
    # Design indicator function
    indicator =np.array([[0, 1 , 1 , 2],
    [1 , 0 , 2 , 1],
    [1 , 2 , 0 , 1],
    [2 , 1 , 1 , 0]])
    # initialize the recombination event array
    recom=np.empty((nInd,nLociAll-1))
    for k in range(0,nLociAll-1):
        for j in range(0,nInd):
            # avoid that the dedault value causes the distortion of recombination event
            if np.all(seg_prob_array[j,k]==default_array) and np.all(seg_prob_array[j,k+1]==default_array) :
                recom[j,k]=0
            else:
                recom[j,k]=np.dot(np.dot(seg_prob_array[j,k],indicator),seg_prob_array[j,k+1])
    recom_added =np.empty((nInd,nLociAll-1))
    for j in range(0,recom.shape[0]):
        for k in range(0,recom.shape[1]):
            if k==0:
#                 seg_i_jplus2=(r*recom[j,k+1]+(1-r)*(1-recom[j,k+1]))/seg_prob_array[j,k+2]
# because have zero that will lead to recombinaiton inf and nan, temporarily remove it 
                seg_i_jplus2=(r*recom[j,k+1]+(1-r)*(1-recom[j,k+1]))
                seg_i_j_1=1
            elif k == recom.shape[1]-1:
#                 seg_i_j_1=(r*recom[j,k-1]+(1-r)*(1-recom[j,k-1]))/seg_prob_array[j,k-1]
                seg_i_j_1=(r*recom[j,k-1]+(1-r)*(1-recom[j,k-1]))
                seg_i_jplus2=1
            else:
#                 seg_i_j_1=(r*recom[j,k-1]+(1-r)*(1-recom[j,k-1]))/seg_prob_array[j,k-1]
#                 seg_i_jplus2=(r*recom[j,k+1]+(1-r)*(1-recom[j,k+1]))/seg_prob_array[j,k+2]
                seg_i_j_1=(r*recom[j,k-1]+(1-r)*(1-recom[j,k-1]))
                seg_i_jplus2=(r*recom[j,k+1]+(1-r)*(1-recom[j,k+1]))
            recom_added[j,k]=np.nansum(seg_i_jplus2)*recom[j,k]*np.nansum(seg_i_j_1)

    return recom_added

I tried to realise your formula; is it correct? if not, let me know, Thank you. The results are as follows:

Data Method Recombination events Run time total Recombination rate True seg original 1636 5.0966 s 0.000409 True seg based on difference 1636 3.6935 s 0.000409 True seg based on Xinger's formula 1636 17.8770 s 0.000408 Estimated seg original 939,659 (rec_prob>0) 5.0581 s 0.0153 Estimated seg based on difference 496,372 (rec_prob>0) 3.7187 s 0.00061 Estimated seg based on Xinger's formula 939,635 (rec_prob>0) 17.9241 s 0.00656 The result looks not good, may I run something wrong?

@AprilYUZhang I just noticed that your total recombination events are counted based on the rec_prob > 0, which explains why they are this many. Do you have the results of the total recombination events for these methods based on rec_prob > 0.5? I think those numbers should be much smaller.

XingerTang commented 1 month ago

@AprilYUZhang I reimplemented your method in another way. Basically, I calculated the recombination rate of $i$ th individual between the $j$ th locus and $j + 1$ th locus $p(recomb_{i, j})$ by

$$ p(recomb{i, j}) = \begin{bmatrix} p(seg{i, j} = (0, 0)) & p(seg{i, j} = (0, 1)) & p(seg{i, j} = (1, 0)) & p(seg{i, j} = (1, 1)) \end{bmatrix} A \begin{bmatrix} p(seg{i, j + 1} = (0, 0)) \ p(seg{i, j + 1} = (0, 1))\ p(seg{i, j + 1} = (1, 0)) \ p(seg_{i, j + 1} = (1, 1)) \end{bmatrix} \tag{1} $$

where the matrix $A$ is given by

$$ A = \begin{bmatrix} 0 & 1 & 1 & 2\ 1 & 0 & 2 & 1 \ 1 & 2 & 0 & 1 \ 2 & 1 & 1 & 0 \end{bmatrix} $$

counts the number of the recombination events occurring between the loci $j$ and $j + 1$.

I managed to access the real recombination data at each locus for each individual, so I didn't average out the recombination rate across the individual, just that we can see the more precise information about whether the algorithm gives a correct estimation of the occurrence of the recombination events for each locus.

Same as what you've done, I used two datasets, the true segregation data and the estimated segregation data for the measures of the accuracies. The true segregation is obtained via the simulation data, while the estimated segregation data is obtained by the multi-loci peeling performed by the AlphaPeel accuracy test. So, there are 1999000 loci pairs used in total and only 1636 recombination events have occurred, and no two recombination events have occurred at the same locus. Thus, for each locus, there is either no recombination or a single recombination occurred.

Given that we have very few recombination events compared to the total number of loci, I measure the accuracy based on two metrics, the recall rate and the precision rate. The recall rate is the proportion of predicted positives that are actually positive to the total number of the positives in the original data, while the precision is the proportion of the predicted positives that are indeed positive to the total number of the predicted positives.

As there is no two recombinations have occurred at the same locus, the estimated recombination rates generated by the (1) are in the range of 0 to 1. So it is natural to set some number between 0 to 1 as the threshold for predicting whether a recombination event occurred at one particular locus. For example, if the threshold is 0.5, then we would predict the locus that has an estimated probability greater or equal to 0.5 to have a recombination event occur and the others have no recombination event.

I plotted the precision-recall curve for different values of the threshold that ranged from 0 to 1.

For the true segregation data input, the result is given by the following

2_true_data

While the recall is always 1, which indicates that all the positives have been successfully predicted, the precision is not very satisfying as the highest precision that could be achieved is only 0.004. It indicated that lots of loci which have no recombination event occurred have been classified as having recombination events occurred and this happened so often that 0.996 of the predicted recombinations are wrong.

As you have discussed before, it is probably caused by the $[0.25, 0.25, 0.25, 0.25]$ segregation inputs from the first generation, so I recalculated the precision-recall curve with the inputs from the second generation of the true segregation data, which is the following

2_true_data_no_first_gen

Now, we have the desired result that the precision of 1 is also achieved for some thresholds. However, this is for the true segregation data, we may not be able to perform as well as this when we are using the estimated segregation data. So I recalculated the precision-recall curve for the full estimated segregation data, which is the following, 2_est_data

This indicates that the highest precision we could possibly achieve is 0.002 for different values of the thresholds, which means that there are always 0.998 estimated recombination events that are wrong. I also do the same calculation for the estimated data excludes the first generation,

2_est_dat_no_first_gen

In this case, we can have a relatively higher precision rate (but still very low) compared to the previous case, but it is only possible when the recall rate is very low, which indicates that we cannot classify the true positives and other cases very well based on this algorithm.

XingerTang commented 1 month ago

Given that the above method cannot find the recombination event without misclassifying locus has no recombination events to be having the recombination events and misclassifying the locus with recombination events occurred to be no recombination has occurred, I implemented several regression on true segregation data and estimated segregation data to find a better matrix $A$ in (1). Most of them suggest that the anti-diagonal should be negative. The matrix $A$ generated from the linear regression on the true segregation data is given by

$$ A = \begin{bmatrix} 0 & 1 & 1 & -2 \ 1 & 0 & -2 & 1 \ 1 & -2 & 0 & 1 \ -2 & 1 & 1 & 0 \end{bmatrix}, $$

which is counterintuitive at the first sight. But the $-2$ on the anti-diagonal can be useful given that the it is very rare that two recombination occur at the same time, and with this matrix, for segregation probabilities equal to $[0.25, 0.25, 0.25, 0.25]$, it will give a recombination probability of $0$, which is what we expect from this case.

For the full true segregation data, the precision-recall curve is given by the following -2_true_data which suggests that it can achieve the most ideal case (the precision and recall can both be 1 at the same time) that every recombination events are classified correctly.

For the full estimate segregation data, the precision-recall curve is given by the following -2_est_data

The precision-recall curve for the segregation data excludes the first generation is the same as this one. We can see that it is better then the previous case that for the same recall rate, this time we can achieve a higher precision rate, which means we can classify the true recombination events better. So for the implementation, this choice of $A$ is a better choice.

However, even for this choice of $A$, we can't find a value of threshold that both precision and recall are both high enough. Thus, I think it is still worth to explore the methods based on segregation probabilities of more of the neighboring loci. What do you think? @AprilYUZhang

AprilYUZhang commented 1 month ago

@XingerTang , I think the reason why your first result has low precision is mainly the effect of the default value ([0.25,0.25,0.25,0.25]). For this problem, my solution is to set this recombination probability as 0, if x1 and x2 == [0.25,0.25,0.25,0.25]. Your method is good, and I also managed to do this before. I think the solution to getting high ROC is: 1. segregation probability may have space to get more accuracy; 2. we can use the recombination probability curve to estimate the recombination event. Now, Gregor and I try to use the Viterbi algorithm for iterative peeling to improve accuracy, as this issue .

XingerTang commented 1 month ago

@XingerTang , I think the reason why your first result has low precision is mainly the effect of the default value ([0.25,0.25,0.25,0.25]). For this problem, my solution is to set this recombination probability as 0, if x1 and x2 == [0.25,0.25,0.25,0.25]. Your method is good, and I also managed to do this before. I think the solution to getting high ROC is: 1. segregation probability may have space to get more accuracy; 2. we can use the recombination probability curve to estimate the recombination event. Now, Gregor and I try to use the Viterbi algorithm for iterative peeling to improve accuracy, as this issue .

@AprilYUZhang I tried the same method on the data without the first generation so most of $[0.25, 0.25, 0.25, 0.25]$ are already gone but the maximum precision that can be achieved is still about 0.09 with 2 on the anti-diagonal.
For the use of the recombination probability curve, we first need to obtain recombination probabilities that are accurate enough. But, given the low precision, I'm afraid it is not that easy to estimate the occurrences of the recombination events based on the current method of the calculation of the recombination probabilities.

AprilYUZhang commented 1 month ago

@XingerTang , I think the reason why your first result has low precision is mainly the effect of the default value ([0.25,0.25,0.25,0.25]). For this problem, my solution is to set this recombination probability as 0, if x1 and x2 == [0.25,0.25,0.25,0.25]. Your method is good, and I also managed to do this before. I think the solution to getting high ROC is: 1. segregation probability may have space to get more accuracy; 2. we can use the recombination probability curve to estimate the recombination event. Now, Gregor and I try to use the Viterbi algorithm for iterative peeling to improve accuracy, as this issue .

@AprilYUZhang I tried the same method on the data without the first generation so most of [0.25,0.25,0.25,0.25] are already gone but the maximum precision that can be achieved is still about 0.09 with 2 on the anti-diagonal. For the use of the recombination probability curve, we first need to obtain recombination probabilities that are accurate enough. But, given the low precision, I'm afraid it is not that easy to estimate the occurrences of the recombination events based on the current method of the calculation of the recombination probabilities.

which data did you use? true segregation or estimate by AlphaPeel. Sorry, I don't understand your meaning. Do you mean that based on the principal , you agree, but based on the practice, you disagree? Or, you totally disagree, and low precision is one of the reasons.

XingerTang commented 1 month ago

@XingerTang , I think the reason why your first result has low precision is mainly the effect of the default value ([0.25,0.25,0.25,0.25]). For this problem, my solution is to set this recombination probability as 0, if x1 and x2 == [0.25,0.25,0.25,0.25]. Your method is good, and I also managed to do this before. I think the solution to getting high ROC is: 1. segregation probability may have space to get more accuracy; 2. we can use the recombination probability curve to estimate the recombination event. Now, Gregor and I try to use the Viterbi algorithm for iterative peeling to improve accuracy, as this issue .

@AprilYUZhang I tried the same method on the data without the first generation so most of [0.25,0.25,0.25,0.25] are already gone but the maximum precision that can be achieved is still about 0.09 with 2 on the anti-diagonal. For the use of the recombination probability curve, we first need to obtain recombination probabilities that are accurate enough. But, given the low precision, I'm afraid it is not that easy to estimate the occurrences of the recombination events based on the current method of the calculation of the recombination probabilities.

which data did you use? true segregation or estimate by AlphaPeel. Sorry, I don't understand your meaning. Do you mean that based on the principal , you agree, but based on the practice, you disagree? Or, you totally disagree, and low precision is one of the reasons.

For the estimated segregation data generated by AlphaPeel without the first generation which most [0.25, 0.25, 0.25, 0.25] are excluded, if we apply the matrix with 2 on the anti-diagonal, the highest possible precision is about 0.09 among all possible thresholds. So I inclined that if we want to improve the estimation of the recombination events by using the recombination probability curve then we need to first achieve a higher precision, which our current method does not support.