implement Viterbi on segregation prob

AprilYUZhang commented 5 months ago

Function to naively-convert [0-1] seg probs to [0, 1] (just 0 or 1) - we will just use max on seg probs obtained from the current iterative sum-product (or some other future) algorithm so this is a really simple function
Function to Viterbi-convert [0-1] seg probs to [0, 1] (just 0 or 1) - as the above one, this function will work on seg probs from the current iterative sum-product (or some other future) algorithm, but we need to do the Viterbi forward pass and then the “backward max” pass to call the most likely event (which could be completed by the above naive-max function!)”

XingerTang commented 5 months ago

@gregorgorjanc

One way to implement the Viterbi algorithm on seg probs is by setting the loci as time, different segregation statuses as states and with transmission function depending on the recombination rate. However, the problems with this implementation are:

it does not satisfy the assumption of HMM models that it does not only depend on the first previous state.
low recombination rate would typically lead to no recombination event predicted.

The other way to implement the algorithm is by first selecting a few recombination hotspots by using the recombination rates, then applying the algorithm with the recombination hotspots $k$ as the time, the estimated segregation status (pp, pm, mp, mm) to the right of $k$ as the states. Let the transmission be dependent on the estimated recombination rate. Since the recombination hotspots are selected based on the recombination rates, they would have higher recombination rates, enabling the predictions of recombination events around the hotspots. Since segregation status only depends on the last previous segregation status in the ideal case (the sites selected are separated by a long distance), this implementation can also solve the first problem above. The probability of observing $Y_k$ given the state $X_k$ can be calculated by a quantity measuring the difference between the predicted segregation status and the real segregation data. The possible difficulties of this implementation are:

incorrect estimated recombination rate would give inaccurate predicted recombination hotspots so a high accuracy of the estimation of the recombination rate is required
how to implement the method to select the hotspots via the recombination rate
how to implement the method to measure the difference between the predicted segregation status and the real segregation data

gregorgorjanc commented 5 months ago

@XingerTang as you know, having good estimates of recombination rates, and hence hotspots, is hard so we can not start from something we don't generally have.

I suggest that we first just do naive-call and see how it looks like. I suspect that it should work well, but we will have some false positives (=switch errors). Do we need an alternative algorithm that would smooth out these short recombination blips (so that we get long stretches of haplotypes - we know there are only a few recombinations between a parent and a progeny!) and create clear cuts from one parent's haplotype to another?

XingerTang commented 5 months ago

@gregorgorjanc Uni is trying to implement the naive-call function for the AlphaPeel format data. I experimented on my side on the segregation data in my custom data format just to see the result quickly. The algorithm is implemented in a very simple way that takes the segregation status with the highest probability and if it encounters multiple probabilities with the same highest value, it will take the first index (so it would choose [1, 0, 0, 0] for [0.25, 0.25, 0.25, 0.25]). I performed the experiment on the segregation data generated by the multi-locus peeling and there are 2000000 loci being evaluated in total.

For the measurement of the performance, the sum of the absolute values of the differences between the predicted segregation data and the true segregation data is used. The real inaccuracies should be obtained by dividing this absolute sum by $2$ as each inaccurate prediction would contribute to $2$ differences.

On the whole data set, the absolute sum of differences is 1245326.

Excluding the first generation with full of [0.25, 0.25, 0.25, 0.25], the absolute sum of differences is 645326. ($\frac{645326}{3200000} \approx 0.20 \implies 20$% of the predicted segregation status are wrong)

XingerTang commented 5 months ago

@gregorgorjanc For smoothing out short recombination blips, a common practice is using the windows but it is difficult to locate the exact position of the recombination event. I'm still working on the regression matrix that uses the segregation probabilities of the neighbouring locus to predict the recombination rate, if that works well, I expect it can smooth at least some short recombination blips.

XingerTang commented 5 months ago

As the generation increases, the performance of the naive calling algorithm is improved except the last generation.

The following data is already divided by 2.

Generation	absolute sum of difference
2	537088
3	42458
4	29474
5	36306

This is probably a result of the improved accuracies of the segregation probabilities estimated by the AlphaPeel for the middle generations.

XingerTang commented 5 months ago

The loci with segregation statuses that were wrongly called often appeared as intervals. Such intervals (including those with a length of 1) have occurred $1562$ times across the loci from generation 2 to generation 5.

In this block, we only consider generation 2 to 5.

The occurrences of the intervals are distributed evenly across all the individuals of all the generations. In particular, they do not tend to be close to the loci where recombination events happen. The histogram below illustrates how the intervals are distributed among the loci, with the left side representing the loci of early generations and the right side representing the loci of later generations. hist

However, the length of the intervals changed for different generations, below is a scatter plot of the lengths of the intervals against the sequence of the intervals. The left side still represents the early generations and the right side represents the later generations. scatter

AprilYUZhang commented 5 months ago

@XingerTang this is my codes for "Naive-Convert" and "Viterbi-convert". Naive-Convert

def naive_convert(seg_prob_array):
    seg_array=np.zeros(seg_prob_array.shape)
    for i in range(0,seg_prob_array.shape[0]):
        for j in range(0,seg_prob_array.shape[1]):
            seg_array[i,j,np.argmax(seg_prob_array[i,j])]=1
    return seg_array

Viterbi-convert

# use the equal recombination rate to generate transmission Vecter for loci
# if have distance, we can imitate li-Stephen model's trasition fucntion and add distance
transmission=np.ones(seg_prob_array.shape[1])*(1636/(800*2000))
def Viterbi(pointSeg, transmission):

    # Segregation estimate state ordering: pp, pm, mp, mm
    nLoci = pointSeg.shape[0]
    seg = np.full(pointSeg.shape, 0.25, dtype=np.float32)
    seg_arg=np.full(pointSeg.shape, 0, dtype=np.int16)

    for i in range(nLoci):
        for j in range(4):
            seg[i,j] = pointSeg[i,j]

    tmp = np.full((4), 0, dtype=np.float32)
    new = np.full((4), 0, dtype=np.float32)
    new_arg= np.full((4), 0, dtype=np.int16)

    prev = np.full((4), 0.25, dtype=np.float32)

    for i in range(0, nLoci):
        # create the transition function
        e = transmission[i - 1]
        e2 = e**2
        e1e = e * (1 - e)
        e2i = (1.0 - e) ** 2
        transition=np.array([[e2i,e1e,e1e,e2],
                            [e1e,e2i,e2,e1e],
                            [e1e,e2,e2i,e1e,],
                            [e2,e1e,e1e,e2i],])
        # if there is first locus,regard [0.25,0.25,0.25,0.25] as the initial state
        if i!=0:
            for j in range(4):
                tmp[j] = prev[j] * pointSeg[i - 1,j]
        else:
            for j in range(4):
                tmp[j] = prev[j] * 0.25

        # normalize
        sum_j = 0
        for j in range(4):
            sum_j += tmp[j]
        if sum_j==0:
            sum_j=1
        for j in range(4):
            tmp[j] = tmp[j] / sum_j

        # !                  fm  fm  fm  fm
        # !segregationOrder: pp, pm, mp, mm
        # forward
        for j in range(4):
            new[j] = np.max(tmp*transition[j]*seg[i,j])
            new_arg[j] = np.argmax(tmp*transition[j]*seg[i,j])

        for j in range(4):
            seg_arg[i,j] = new_arg[j]
        prev = new

    # there are four ways we can track, use the most likely state for the final locus (results from the segregation probability )
    prev_index=np.argmax(pointSeg[-1])

    # back track 
    track_rec=np.zeros(seg_arg.shape[0]-1,dtype=np.int16)
    a=np.zeros(seg_arg.shape[0],dtype=np.int16)
    for i in range(seg_arg.shape[0]-1,-1,-1):
        prev_index=seg_arg[i][prev_index]
        a[i]=prev_index

    # find the breakpoint
    for i in range(seg_arg.shape[0]-1):
        if a[i]!=a[i+1]:
            track_rec[i]=1
    return track_rec
# run every individuals
seg_a=np.zeros((1000,1999),dtype=np.int16)
for individual in range(seg_a.shape[0]):
    track_rec = Viterbi(seg_prob_array[individual],transmission)
    seg_a[individual]+=track_rec

AprilYUZhang commented 5 months ago

@gregorgorjanc and @XingerTang , There is a good news and bad news. The good one is that Naive-converted have better performance than I expected. Basically, the results of Viterbi algorithm is equal to that of Naive-converted, but more running time.	Algorithm	Run time
Original	5.06s	0.9706
Naive-convert	1.53s+5.01s	0.9982
Viterbi-convert	53.48s	0.9981

The details for every generation:	Generation	Original	Naive-convert
1	1.0	1.0	1.0
2	0.9280	0.9970	0.9969
3	0.9739	0.9978	0.9977
4	0.9777	0.9982	0.9980
5	0.9734	0.9982	0.9981

XingerTang commented 5 months ago

@AprilYUZhang I'm wondering how you measure the accuracy of the recombination probability

AprilYUZhang commented 5 months ago

Just 1 - difference, maybe not very accurancy.

def cor2(a,b):
    cos_theta=0
    for i in range(0,a.shape[0]):
        for j in range(0,a.shape[1]):
            cos_theta += np.abs(a[i,j]-b[i,j])
    return 1-cos_theta/(a.shape[0]*a.shape[1])

XingerTang commented 5 months ago

@AprilYUZhang Could you measure the accuracy of the prediction of the segregation statuses for both methods?

AprilYUZhang commented 5 months ago

Viterbi method just uses the Original's segregation probabolities, so I skiped. Fisrt one is naive-convert, Second one is Original.

I also notice the segregation probabolities and recombination rate have the opposite results. I think it should be attributed to too much noise in the segregation prob.

XingerTang commented 5 months ago

@AprilYUZhang By calling the segregation status, it definitely would lose some information, which could contribute to the decrease in accuracy. Your Viterbi implementation can be easily modified to record the predicted segregation status (just return the list a). Could you check the result again with the Viterbi predicted segregation statuses to compare the two implementations for calling the segregation status?

AprilYUZhang commented 5 months ago

In fact, Viterbi could not output a new segregation probablity, because it is just max-production, I guess it shouldn't be regards a new segregation probablity.

XingerTang commented 5 months ago

Isn't the argmax state (segregation status) the most likely state (segregation status) given by the Viterbi?

AprilYUZhang commented 5 months ago

I don't think we should focus on this because the most likely state for every locus is just based on the former one. If we look at this number alone, it is meaningless.

XingerTang commented 5 months ago

It is also based on all the previous loci because the probabilities are passed via each multiplication. The reason why I still insist on checking the segregation status is that sometimes the accuracy of recombination predictions can lie, it may have a recomb, but of the wrong haplotype.

AprilYUZhang commented 5 months ago

XingerTang commented 5 months ago

It seems that the Viterbi implementation preserves much more information than naive converted states, that's good news!

AprilYUZhang commented 5 months ago

@gregorgorjanc and @XingerTang To fit the precision and recall, I convert the original recombination probability to discrete values by logistic regression. (I didn't split train and test dataset, so there is risk of overfitting.) This plot shows what they are. The green box includes the purple cherries identified correctly by the classifier, while the oranges identified correctly by the classifier are in the blue box. The yellow box is cherries, but it was incorrectly identified as oranges. The opposite is the oranges in the orange box. Untitled 2024-05-31 12_39_00

Accuracy is all the correct results divided by the total.The recall rate can be regarded as how many cherries can be identified correctly, and the precision is how many of the results that are considered cherries are correct. Table 1: Performance of method to estimate recombination events	Method	accuracy	precision
Original+logistic	0.9992	0.8064	0.0153
Naive-Convert	0.9982	0.0948	0.1363
Viterbi-Convert	0.9981	0.0436	0.0623

So we can see that Original+Logistic carefully judges recombination events, leading to high accuracy and precision. But it ignores a large part of the real recombination events, causing a low recall rate . Naive-Convert has better performance on recall rates. According to precision and recall curves, the Naive-convert has the best performance.

Which is a good method to estimate the recombination events? It depends on what we can sacrifice. As shown in Table 2, a total of 1636 true recombination events occur. Original+Logitic can detect only 31, far less the true number. But 25 of the 31 recombination events are the same as true, which is the highest proportion. However, it is too small compared to the total numbers. Although this problem can be improved by modifying the parameters of logistic regression, training every dataset is not feasible in practice. The benefit of discretizing the probability is that the difference between the total number of recombination events and the true value is further reduced, whether it is Naive-converted or Viterbi-converted, especially in the fourth generation, where the difference is the smallest.

To explore the gap between predicted and true locations, I performed an intersection between various methods and the true values. Due to the limitations of these methods, which may lead to inaccurate judgments of the locations where recombination events occurred, I also wonder if the predicted recombination events are in the neighbouring true values. Although the performance of exact prediction is poor, it will improve when we extend the region. For example, Naive-converted just predicts precisely 223 recombination events, while 1056 recombination events happen to true loci or surrounding true loci.

Faced with such a long Markov chain, Naive-converted and Viterbi-converted have limited ability to accurately predict each point. Also, the comparison of Naive-converted and Viterbi-converted shows this point. The intersection of Naive-converted and Viterbi-converted includes only 15, but the rough intersection is more than their own predicted number (Because nearby sites will be counted repeatedly). Apparently, Naive-converted and Viterbi-converted rely on different algorithms but have similar results.

Table 2: The number of recombination events		G1	G2	G3	G4
True	1636	392	401	427	416
Original (>0)	937317	282966	203289	210453	240609
Original+logistic	31	3	14	14	0
Naive-converted	2353	907	614	434	398
Viterbi-converted	2337	893	611	435	398
True and Original	1616	378	396	426	416
True and Naive-converted	223	43	70	70	40
True and Naive-converted neighbouring	1056	194	287	301	274
True and Viterbi-converted	102	14	37	30	21
True and Viterbi-converted neighbouring	1056	195	284	302	273
True and Original+logistic	25	2	11	12	0
True and Original+logistic neighbouring	26	3	11	12	0
Naive-converted and Viterbi-converted	15	5	4	5	1
Naive-converted and Viterbi-converted neighbouring	2379	916	619	440	404

*Original (>0) is there is a recombination event if recombination probablity is greater than 0. "A and B" means to the number of recombination events of A and B occurring at the same position. "A and B neighbouring" means to the number of recombination events of A and B occurring at the 20 variant window.

XingerTang commented 5 months ago

@AprilYUZhang Could you clarify some terms in your table 2? It is unclear to me what true and original+logistic mean.

AprilYUZhang commented 5 months ago

"original+logistic: To fit the precision and recall, I convert the original recombination probability to discrete values by logistic regression. (I didn't split train and test dataset, so there is risk of overfitting.)" true : the recombination estimated from true segregation probability.

XingerTang commented 5 months ago

@AprilYUZhang Logistic regression assumes the recombination probabilities to be independent, so it is not applicable in this way. The number of recombination events predicted is not very useful in general, because we could have a lot of false positives or false negatives and for the precision-recall curve you only tried very few thresholds, I don't know what kind of conclusions we can draw from these results.

AlphaGenes / AlphaPeel

implement Viterbi on segregation prob #165