astheeggeggs / lshmm

code to run Li and Stephens
MIT License
3 stars 3 forks source link

Handle NONCOPY in reference panel #40

Closed szhan closed 1 month ago

szhan commented 1 month ago

We want to handle NONCOPY states in partial ancestral haplotypes in the reference panel when running the LS Viterbi and forward-backward algorithms, for both the haploid and diploid cases.

I should have created an issue for PR #31 first. Some of the discussion has been happening there. Below is some explanation copied from PR #31.

"If ancestral haplotypes are used as part of a reference panel for LS matching, then non-copying states in them need to be treated differently than other states. When a non-copying state is encountered, the emission probability should be 0, because it is not allowed to copy from it. We arbitrarily set the numeric value of non-copying states to be -2, which is distinct from missing state that is set to -1. Here, we modify how emission probabilities are computed in the haploid version of the Viterbi algorithm. We also add examples of reference panels, queries, and expected paths for testing."

szhan commented 1 month ago

In the current implementation, emission probabilities are stored in and accessed through a matrix (number of sites by 2). When there is a mismatch between the alleles in the query and a reference haplotype at site m, the value indexed at (m, 0) is accessed; otherwise, the value indexed at (m, 1) is accessed.

I'm not sure if this a good way to get the emission probabilities when we have a NONCOPY state, because the emission probability is always set to zero. A simple way to deal with this without make too many changes is to add a simple function that checks whether the reference haplotype in question has a NONCOPY state before evaluating whether the alleles in the query and reference haplotypes are matched or not.

Something like this, I think.

def get_emission_probability(ref_allele, query_allele, site, emission_matrix):
    if ref_allele == NONCOPY:
        return 0.0
    else:
        emission_index = get_index_in_emission_matrix_haploid(ref_allele, query_allele)
        return emission_matrix[site, emission_index]
szhan commented 1 month ago

Note to self. In the diploid case, if either one of the two haplotypes has a NONCOPY state at a given site, then the emission probability is zero. So, these are the cases: