Open AprilYUZhang opened 5 months ago
@gregorgorjanc
One way to implement the Viterbi algorithm on seg probs is by setting the loci as time, different segregation statuses as states and with transmission function depending on the recombination rate. However, the problems with this implementation are:
The other way to implement the algorithm is by first selecting a few recombination hotspots by using the recombination rates, then applying the algorithm with the recombination hotspots $k$ as the time, the estimated segregation status (pp, pm, mp, mm) to the right of $k$ as the states. Let the transmission be dependent on the estimated recombination rate. Since the recombination hotspots are selected based on the recombination rates, they would have higher recombination rates, enabling the predictions of recombination events around the hotspots. Since segregation status only depends on the last previous segregation status in the ideal case (the sites selected are separated by a long distance), this implementation can also solve the first problem above. The probability of observing $Y_k$ given the state $X_k$ can be calculated by a quantity measuring the difference between the predicted segregation status and the real segregation data. The possible difficulties of this implementation are:
@XingerTang as you know, having good estimates of recombination rates, and hence hotspots, is hard so we can not start from something we don't generally have.
I suggest that we first just do naive-call and see how it looks like. I suspect that it should work well, but we will have some false positives (=switch errors). Do we need an alternative algorithm that would smooth out these short recombination blips (so that we get long stretches of haplotypes - we know there are only a few recombinations between a parent and a progeny!) and create clear cuts from one parent's haplotype to another?
@gregorgorjanc Uni is trying to implement the naive-call function for the AlphaPeel format data. I experimented on my side on the segregation data in my custom data format just to see the result quickly. The algorithm is implemented in a very simple way that takes the segregation status with the highest probability and if it encounters multiple probabilities with the same highest value, it will take the first index (so it would choose [1, 0, 0, 0] for [0.25, 0.25, 0.25, 0.25]). I performed the experiment on the segregation data generated by the multi-locus peeling and there are 2000000 loci being evaluated in total.
For the measurement of the performance, the sum of the absolute values of the differences between the predicted segregation data and the true segregation data is used. The real inaccuracies should be obtained by dividing this absolute sum by $2$ as each inaccurate prediction would contribute to $2$ differences.
On the whole data set, the absolute sum of differences is 1245326.
Excluding the first generation with full of [0.25, 0.25, 0.25, 0.25], the absolute sum of differences is 645326. ($\frac{645326}{3200000} \approx 0.20 \implies 20$% of the predicted segregation status are wrong)
@gregorgorjanc For smoothing out short recombination blips, a common practice is using the windows but it is difficult to locate the exact position of the recombination event. I'm still working on the regression matrix that uses the segregation probabilities of the neighbouring locus to predict the recombination rate, if that works well, I expect it can smooth at least some short recombination blips.
As the generation increases, the performance of the naive calling algorithm is improved except the last generation.
The following data is already divided by 2.
Generation | absolute sum of difference |
---|---|
2 | 537088 |
3 | 42458 |
4 | 29474 |
5 | 36306 |
This is probably a result of the improved accuracies of the segregation probabilities estimated by the AlphaPeel for the middle generations.
The loci with segregation statuses that were wrongly called often appeared as intervals. Such intervals (including those with a length of 1) have occurred $1562$ times across the loci from generation 2 to generation 5.
In this block, we only consider generation 2 to 5.
The occurrences of the intervals are distributed evenly across all the individuals of all the generations. In particular, they do not tend to be close to the loci where recombination events happen. The histogram below illustrates how the intervals are distributed among the loci, with the left side representing the loci of early generations and the right side representing the loci of later generations.
However, the length of the intervals changed for different generations, below is a scatter plot of the lengths of the intervals against the sequence of the intervals. The left side still represents the early generations and the right side represents the later generations.
@XingerTang this is my codes for "Naive-Convert" and "Viterbi-convert". Naive-Convert
def naive_convert(seg_prob_array):
seg_array=np.zeros(seg_prob_array.shape)
for i in range(0,seg_prob_array.shape[0]):
for j in range(0,seg_prob_array.shape[1]):
seg_array[i,j,np.argmax(seg_prob_array[i,j])]=1
return seg_array
Viterbi-convert
# use the equal recombination rate to generate transmission Vecter for loci
# if have distance, we can imitate li-Stephen model's trasition fucntion and add distance
transmission=np.ones(seg_prob_array.shape[1])*(1636/(800*2000))
def Viterbi(pointSeg, transmission):
# Segregation estimate state ordering: pp, pm, mp, mm
nLoci = pointSeg.shape[0]
seg = np.full(pointSeg.shape, 0.25, dtype=np.float32)
seg_arg=np.full(pointSeg.shape, 0, dtype=np.int16)
for i in range(nLoci):
for j in range(4):
seg[i,j] = pointSeg[i,j]
tmp = np.full((4), 0, dtype=np.float32)
new = np.full((4), 0, dtype=np.float32)
new_arg= np.full((4), 0, dtype=np.int16)
prev = np.full((4), 0.25, dtype=np.float32)
for i in range(0, nLoci):
# create the transition function
e = transmission[i - 1]
e2 = e**2
e1e = e * (1 - e)
e2i = (1.0 - e) ** 2
transition=np.array([[e2i,e1e,e1e,e2],
[e1e,e2i,e2,e1e],
[e1e,e2,e2i,e1e,],
[e2,e1e,e1e,e2i],])
# if there is first locus,regard [0.25,0.25,0.25,0.25] as the initial state
if i!=0:
for j in range(4):
tmp[j] = prev[j] * pointSeg[i - 1,j]
else:
for j in range(4):
tmp[j] = prev[j] * 0.25
# normalize
sum_j = 0
for j in range(4):
sum_j += tmp[j]
if sum_j==0:
sum_j=1
for j in range(4):
tmp[j] = tmp[j] / sum_j
# ! fm fm fm fm
# !segregationOrder: pp, pm, mp, mm
# forward
for j in range(4):
new[j] = np.max(tmp*transition[j]*seg[i,j])
new_arg[j] = np.argmax(tmp*transition[j]*seg[i,j])
for j in range(4):
seg_arg[i,j] = new_arg[j]
prev = new
# there are four ways we can track, use the most likely state for the final locus (results from the segregation probability )
prev_index=np.argmax(pointSeg[-1])
# back track
track_rec=np.zeros(seg_arg.shape[0]-1,dtype=np.int16)
a=np.zeros(seg_arg.shape[0],dtype=np.int16)
for i in range(seg_arg.shape[0]-1,-1,-1):
prev_index=seg_arg[i][prev_index]
a[i]=prev_index
# find the breakpoint
for i in range(seg_arg.shape[0]-1):
if a[i]!=a[i+1]:
track_rec[i]=1
return track_rec
# run every individuals
seg_a=np.zeros((1000,1999),dtype=np.int16)
for individual in range(seg_a.shape[0]):
track_rec = Viterbi(seg_prob_array[individual],transmission)
seg_a[individual]+=track_rec
@gregorgorjanc and @XingerTang , There is a good news and bad news. The good one is that Naive-converted have better performance than I expected. Basically, the results of Viterbi algorithm is equal to that of Naive-converted, but more running time. | Algorithm | Run time | Accurancy of recombination probability |
---|---|---|---|
Original | 5.06s | 0.9706 | |
Naive-convert | 1.53s+5.01s | 0.9982 | |
Viterbi-convert | 53.48s | 0.9981 |
The details for every generation: | Generation | Original | Naive-convert | Viterbi-convert | one of individuals |
---|---|---|---|---|---|
1 | 1.0 | 1.0 | 1.0 | ||
2 | 0.9280 | 0.9970 | 0.9969 | ||
3 | 0.9739 | 0.9978 | 0.9977 | ||
4 | 0.9777 | 0.9982 | 0.9980 | ||
5 | 0.9734 | 0.9982 | 0.9981 |
@gregorgorjanc and @XingerTang , There is a good news and bad news. The good one is that Naive-converted have better performance than I expected. Basically, the results of Viterbi algorithm is equal to that of Naive-converted, but more running time.
Algorithm Run time Accurancy of recombination probability Original 5.06s 0.9706 Naive-convert 1.53s+5.01s 0.9982 Viterbi-convert 53.48s 0.9981 The details for every generation:
Generation Original Naive-convert Viterbi-convert one of individuals 1 1.0 1.0 1.0 2 0.9280 0.9970 0.9969 3 0.9739 0.9978 0.9977 4 0.9777 0.9982 0.9980 5 0.9734 0.9982 0.9981
@AprilYUZhang I'm wondering how you measure the accuracy of the recombination probability
@gregorgorjanc and @XingerTang , There is a good news and bad news. The good one is that Naive-converted have better performance than I expected. Basically, the results of Viterbi algorithm is equal to that of Naive-converted, but more running time. Algorithm Run time Accurancy of recombination probability Original 5.06s 0.9706 Naive-convert 1.53s+5.01s 0.9982 Viterbi-convert 53.48s 0.9981 The details for every generation: Generation Original Naive-convert Viterbi-convert one of individuals 1 1.0 1.0 1.0 2 0.9280 0.9970 0.9969 3 0.9739 0.9978 0.9977 4 0.9777 0.9982 0.9980 5 0.9734 0.9982 0.9981
@AprilYUZhang I'm wondering how you measure the accuracy of the recombination probability
Just 1 - difference, maybe not very accurancy.
def cor2(a,b):
cos_theta=0
for i in range(0,a.shape[0]):
for j in range(0,a.shape[1]):
cos_theta += np.abs(a[i,j]-b[i,j])
return 1-cos_theta/(a.shape[0]*a.shape[1])
@AprilYUZhang Could you measure the accuracy of the prediction of the segregation statuses for both methods?
@AprilYUZhang Could you measure the accuracy of the prediction of the segregation statuses for both methods?
Viterbi method just uses the Original's segregation probabolities, so I skiped. Fisrt one is naive-convert, Second one is Original.
I also notice the segregation probabolities and recombination rate have the opposite results. I think it should be attributed to too much noise in the segregation prob.
@AprilYUZhang By calling the segregation status, it definitely would lose some information, which could contribute to the decrease in accuracy. Your Viterbi implementation can be easily modified to record the predicted segregation status (just return the list a
). Could you check the result again with the Viterbi predicted segregation statuses to compare the two implementations for calling the segregation status?
@AprilYUZhang By calling the segregation status, it definitely would lose some information, which could contribute to the decrease in accuracy. Your Viterbi implementation can be easily modified to record the predicted segregation status (just return the list
a
). Could you check the result again with the Viterbi predicted segregation statuses to compare the two implementations for calling the segregation status?
In fact, Viterbi could not output a new segregation probablity, because it is just max-production, I guess it shouldn't be regards a new segregation probablity.
@AprilYUZhang By calling the segregation status, it definitely would lose some information, which could contribute to the decrease in accuracy. Your Viterbi implementation can be easily modified to record the predicted segregation status (just return the list
a
). Could you check the result again with the Viterbi predicted segregation statuses to compare the two implementations for calling the segregation status?In fact, Viterbi could not output a new segregation probablity, because it is just max-production, I guess it shouldn't be regards a new segregation probablity.
Isn't the argmax state (segregation status) the most likely state (segregation status) given by the Viterbi?
@AprilYUZhang By calling the segregation status, it definitely would lose some information, which could contribute to the decrease in accuracy. Your Viterbi implementation can be easily modified to record the predicted segregation status (just return the list
a
). Could you check the result again with the Viterbi predicted segregation statuses to compare the two implementations for calling the segregation status?In fact, Viterbi could not output a new segregation probablity, because it is just max-production, I guess it shouldn't be regards a new segregation probablity.
Isn't the argmax state (segregation status) the most likely state (segregation status) given by the Viterbi?
I don't think we should focus on this because the most likely state for every locus is just based on the former one. If we look at this number alone, it is meaningless.
@AprilYUZhang By calling the segregation status, it definitely would lose some information, which could contribute to the decrease in accuracy. Your Viterbi implementation can be easily modified to record the predicted segregation status (just return the list
a
). Could you check the result again with the Viterbi predicted segregation statuses to compare the two implementations for calling the segregation status?In fact, Viterbi could not output a new segregation probablity, because it is just max-production, I guess it shouldn't be regards a new segregation probablity.
Isn't the argmax state (segregation status) the most likely state (segregation status) given by the Viterbi?
I don't think we should focus on this because the most likely state for every locus is just based on the former one. If we look at this number alone, it is meaningless.
It is also based on all the previous loci because the probabilities are passed via each multiplication. The reason why I still insist on checking the segregation status is that sometimes the accuracy of recombination predictions can lie, it may have a recomb, but of the wrong haplotype.
@AprilYUZhang By calling the segregation status, it definitely would lose some information, which could contribute to the decrease in accuracy. Your Viterbi implementation can be easily modified to record the predicted segregation status (just return the list
a
). Could you check the result again with the Viterbi predicted segregation statuses to compare the two implementations for calling the segregation status?In fact, Viterbi could not output a new segregation probablity, because it is just max-production, I guess it shouldn't be regards a new segregation probablity.
Isn't the argmax state (segregation status) the most likely state (segregation status) given by the Viterbi?
I don't think we should focus on this because the most likely state for every locus is just based on the former one. If we look at this number alone, it is meaningless.
It is also based on all the previous loci because the probabilities are passed via each multiplication. The reason why I still insist on checking the segregation status is that sometimes the accuracy of recombination predictions can lie, it may have a recomb, but of the wrong haplotype.
@AprilYUZhang By calling the segregation status, it definitely would lose some information, which could contribute to the decrease in accuracy. Your Viterbi implementation can be easily modified to record the predicted segregation status (just return the list
a
). Could you check the result again with the Viterbi predicted segregation statuses to compare the two implementations for calling the segregation status?In fact, Viterbi could not output a new segregation probablity, because it is just max-production, I guess it shouldn't be regards a new segregation probablity.
Isn't the argmax state (segregation status) the most likely state (segregation status) given by the Viterbi?
I don't think we should focus on this because the most likely state for every locus is just based on the former one. If we look at this number alone, it is meaningless.
It is also based on all the previous loci because the probabilities are passed via each multiplication. The reason why I still insist on checking the segregation status is that sometimes the accuracy of recombination predictions can lie, it may have a recomb, but of the wrong haplotype.
It seems that the Viterbi implementation preserves much more information than naive converted states, that's good news!
@gregorgorjanc and @XingerTang To fit the precision and recall, I convert the original recombination probability to discrete values by logistic regression. (I didn't split train and test dataset, so there is risk of overfitting.) This plot shows what they are. The green box includes the purple cherries identified correctly by the classifier, while the oranges identified correctly by the classifier are in the blue box. The yellow box is cherries, but it was incorrectly identified as oranges. The opposite is the oranges in the orange box.
Accuracy is all the correct results divided by the total.The recall rate can be regarded as how many cherries can be identified correctly, and the precision is how many of the results that are considered cherries are correct. Table 1: Performance of method to estimate recombination events | Method | accuracy | precision | recall |
---|---|---|---|---|
Original+logistic | 0.9992 | 0.8064 | 0.0153 | |
Naive-Convert | 0.9982 | 0.0948 | 0.1363 | |
Viterbi-Convert | 0.9981 | 0.0436 | 0.0623 |
So we can see that Original+Logistic carefully judges recombination events, leading to high accuracy and precision. But it ignores a large part of the real recombination events, causing a low recall rate . Naive-Convert has better performance on recall rates. According to precision and recall curves, the Naive-convert has the best performance.
Which is a good method to estimate the recombination events? It depends on what we can sacrifice. As shown in Table 2, a total of 1636 true recombination events occur. Original+Logitic can detect only 31, far less the true number. But 25 of the 31 recombination events are the same as true, which is the highest proportion. However, it is too small compared to the total numbers. Although this problem can be improved by modifying the parameters of logistic regression, training every dataset is not feasible in practice. The benefit of discretizing the probability is that the difference between the total number of recombination events and the true value is further reduced, whether it is Naive-converted or Viterbi-converted, especially in the fourth generation, where the difference is the smallest.
To explore the gap between predicted and true locations, I performed an intersection between various methods and the true values. Due to the limitations of these methods, which may lead to inaccurate judgments of the locations where recombination events occurred, I also wonder if the predicted recombination events are in the neighbouring true values. Although the performance of exact prediction is poor, it will improve when we extend the region. For example, Naive-converted just predicts precisely 223 recombination events, while 1056 recombination events happen to true loci or surrounding true loci.
Faced with such a long Markov chain, Naive-converted and Viterbi-converted have limited ability to accurately predict each point. Also, the comparison of Naive-converted and Viterbi-converted shows this point. The intersection of Naive-converted and Viterbi-converted includes only 15, but the rough intersection is more than their own predicted number (Because nearby sites will be counted repeatedly). Apparently, Naive-converted and Viterbi-converted rely on different algorithms but have similar results.
Table 2: The number of recombination events | Whole Pedigree | G1 | G2 | G3 | G4 | G5 | |
---|---|---|---|---|---|---|---|
True | 1636 | 0 | 392 | 401 | 427 | 416 | |
Original (>0) | 937317 | 0 | 282966 | 203289 | 210453 | 240609 | |
Original+logistic | 31 | 0 | 3 | 14 | 14 | 0 | |
Naive-converted | 2353 | 0 | 907 | 614 | 434 | 398 | |
Viterbi-converted | 2337 | 0 | 893 | 611 | 435 | 398 | |
True and Original | 1616 | 0 | 378 | 396 | 426 | 416 | |
True and Naive-converted | 223 | 0 | 43 | 70 | 70 | 40 | |
True and Naive-converted neighbouring | 1056 | 0 | 194 | 287 | 301 | 274 | |
True and Viterbi-converted | 102 | 0 | 14 | 37 | 30 | 21 | |
True and Viterbi-converted neighbouring | 1056 | 0 | 195 | 284 | 302 | 273 | |
True and Original+logistic | 25 | 0 | 2 | 11 | 12 | 0 | |
True and Original+logistic neighbouring | 26 | 0 | 3 | 11 | 12 | 0 | |
Naive-converted and Viterbi-converted | 15 | 0 | 5 | 4 | 5 | 1 | |
Naive-converted and Viterbi-converted neighbouring | 2379 | 0 | 916 | 619 | 440 | 404 |
*Original (>0) is there is a recombination event if recombination probablity is greater than 0. "A and B" means to the number of recombination events of A and B occurring at the same position. "A and B neighbouring" means to the number of recombination events of A and B occurring at the 20 variant window.
@AprilYUZhang Could you clarify some terms in your table 2? It is unclear to me what true and original+logistic mean.
To fit the precision and recall, I convert the original recombination probability to discrete values by logistic regression. (I didn't split train and test dataset, so there is risk of overfitting.)
"original+logistic: To fit the precision and recall, I convert the original recombination probability to discrete values by logistic regression. (I didn't split train and test dataset, so there is risk of overfitting.)" true : the recombination estimated from true segregation probability.
@AprilYUZhang Logistic regression assumes the recombination probabilities to be independent, so it is not applicable in this way. The number of recombination events predicted is not very useful in general, because we could have a lot of false positives or false negatives and for the precision-recall curve you only tried very few thresholds, I don't know what kind of conclusions we can draw from these results.