Modify haploid Viterbi/FB to handle NONCOPY state in reference panel

szhan commented 3 months ago

If ancestral haplotypes are used as part of a reference panel for LS matching, then non-copying states in them need to be treated differently than other states. When a non-copying state is encountered, the emission probability should be 0, because it is not allowed to copy from it. We arbitrarily set the numeric value of non-copying states to be -2, which is distinct from missing state that is set to -1. Here, we modify how emission probabilities are computed in the haploid version of the Viterbi algorithm. We also add examples of reference panels, queries, and expected paths for testing.

szhan commented 3 months ago

The tests cover the following:

Four types of ancestral haplotypes that span:
- From the first position to the middle, with NONCOPY states on the right side.
- From the middle to the last position, with NONCOPY states on the left side.
- A middle section, excluding the first and last few positions, which are occupied by NONCOPY states.
- Completely from end to end, with no NONCOPY states.
Switching from:
- Sample to sample (1 switch).
- Sample to ancestor (1 switch).
- Ancestor to sample (1 switch).
- Ancestor to ancestor (1 switch).
- Sample to ancestor to sample (2 switches).
- Ancestor to ancestor to sample (2 switches).
Missing data in query at a switch position (which delays switching) and at a position to the left of it (which results in no delay in switching).
Multiallelic sites.
Two equally likely paths.

szhan commented 3 months ago

Presently, rho / n, which is part of the calculation of switch probabilities, is computed by dividing rho along the sites by a constant n, which is the number of reference haplotypes. It is part of the standard formulation of Li & Stephens HMM which does not define any non-copying state in the reference panel. Here, because of the presence of NONCOPY, we should allow the denominator to vary along the sites according to the number of reference haplotypes without the NONCOPY state. The consequences of this for the Viterbi, forward, backward, and forward-backward algorithms should be discussed in a separate issue.

astheeggeggs commented 3 months ago

So the intuition here is that the number of templates in the reference panel jumps from n -> 2n -1, and you fill the ancestral reference templates with missing values (on anything really) where there is no ancestral material in those haplotypes, give them a 'NON_COPY' label at those sites, and set the corresponding emission prob to 0. You can then set recombination rate as rho_j/(2*n-1-|sum(NON_COPY_j)|) for each site j. That logic all makes sense to me!

On Fri, Mar 22, 2024 at 9:10 AM Jerome Kelleher @.***> wrote:

@.**** commented on this pull request.

Nice, looks great @szhan https://github.com/szhan.

@astheeggeggs https://github.com/astheeggeggs, what do you think? It would be really helpful to have a direct matrix-based implementation of the full "all haplotypes" model we can compare with, and this seems like the most natural way. Would be great to get your thoughts here.

— Reply to this email directly, view it on GitHub https://github.com/astheeggeggs/lshmm/pull/31#pullrequestreview-1954297205, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVQA77ORBRY3VYMCOP7QA3YZPYSBAVCNFSM6AAAAABFB2M6BGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTSNJUGI4TOMRQGU . You are receiving this because you were mentioned.Message ID: @.***>

szhan commented 3 months ago

Is there a point to add tests for cases where we expect at least two equally likely paths? It seems quite easy to encounter cases with more than one best path when the number of reference haplotypes and number of sites grow.

szhan commented 3 months ago

I have reworked the tests, which now cover the following queries copying from:

An ancestor with only NONCOPY states.
An ancestor on the left and a sample on the right.
A sample on the left and an ancestor on the right.
An ancestor in the middle and a sample in both the flanking regions.
Two ancestors and one sample (two switches to a different haplotype).
An ancestor and a sample but with a MISSING state at the switch position.
An ancestor and a sample but with a MISSING state left of the switch position.

In the test ref. panels, all the ancestors carry at least one NONCOPY state, and all the samples carry no NONCOPY state.

Instead of asserting that the expected path and the actual path are identical, the new tests assert that the log-likelihood of the expected path and that of the actual path are approximately equal (i.e. passing np.allclose). This avoids needing to deal with equally likely paths that may not be so obvious (it gets too tedious to check anyway).

szhan commented 3 months ago

Just to add a note here that in tsinfer n (i.e. number of match nodes) is constant along the genome, given a ref. panel. So, the above tests should test the behavior of LS HMM Viterbi algorithm as in tsinfer, I think?

szhan commented 3 months ago

I have two sets of tests passing now. One set (test_API_noncopy_manual.py) contains manually crafted test examples, and the other set (test_API_noncopy.py) has more or less the same set of tests as test_API_multiallelic.py.

In both sets of tests, NONCOPY characters in the reference panel are not counted when computing the number of alleles per site (i.e. n_alleles), which is used for mutation rate scaling.

The set of tests to keep is in test_API_noncopy.py. There are several major differences between this test file and test_API_multiallelic.py, on which it is heavily based.

Instead of getting a genotype matrix from a simulated tree sequence as the reference panel, it calls another function (get_ancestral_haplotypes) to extract ancestral haplotypes from it. When there is recombination in the past, partial haplotypes with NONCOPY denoting non-ancestral material are incorporated into the reference panel.
Tree sequences are simulated using SMC instead of Hudson's, in order to avoid generating discontiguous partial haplotypes caused by trapped non-ancestral material.
NONCOPY and MISSING characters are not counted as distinct alleles (see related #33).

Probably I should delete test_API_noncopy_manual.py, unless some of the hand-crafted examples may be useful for testing. @jeromekelleher ?

Otherwise, I think the tests are ready for review @astheeggeggs, besides some more minor tweaks I'd like to make.

szhan commented 3 months ago

Ah, by modifying lshmm/api.py, both the haploid and diploid versions are affected, but I haven't added tests for the diploid version.

Also, I think we should probably include assertions that none of the most likely paths go through a base in the reference panel that is equal to NONCOPY.

astheeggeggs commented 3 months ago

Looking at this again, I think it needs a rethink. We have emission probabilities that don't sum to 1, which doesn't make sense in the usual HMM framework.

szhan commented 2 months ago

I'm going to scrap most of the code here, since it makes sense to do refactoring before adding new features. I'll probably just keep some of the handcrafted test cases and the routine that gets a ref. panel with NONCOPY states.

szhan commented 2 months ago

Aside from rethinking about this, we should probably refactor the code a bit first. See #37

szhan commented 1 month ago

I've deleted this branch of code, but it is probably useful to keep the discussion here.

szhan commented 1 month ago

Thanks @astheeggeggs ! Please do not merge the code in this branch! Please look at the code in #41 instead.

szhan commented 1 month ago

Closing thie PR because it's replaced by PR #41.

astheeggeggs / lshmm

Modify haploid Viterbi/FB to handle NONCOPY state in reference panel #31