Closed szhan closed 1 month ago
The tests cover the following:
NONCOPY
states on the right side.NONCOPY
states on the left side.NONCOPY
states.NONCOPY
states.Presently, rho / n
, which is part of the calculation of switch probabilities, is computed by dividing rho
along the sites by a constant n
, which is the number of reference haplotypes. It is part of the standard formulation of Li & Stephens HMM which does not define any non-copying state in the reference panel. Here, because of the presence of NONCOPY
, we should allow the denominator to vary along the sites according to the number of reference haplotypes without the NONCOPY
state. The consequences of this for the Viterbi, forward, backward, and forward-backward algorithms should be discussed in a separate issue.
So the intuition here is that the number of templates in the reference panel jumps from n -> 2n -1, and you fill the ancestral reference templates with missing values (on anything really) where there is no ancestral material in those haplotypes, give them a 'NON_COPY' label at those sites, and set the corresponding emission prob to 0. You can then set recombination rate as rho_j/(2*n-1-|sum(NON_COPY_j)|) for each site j. That logic all makes sense to me!
On Fri, Mar 22, 2024 at 9:10 AM Jerome Kelleher @.***> wrote:
@.**** commented on this pull request.
Nice, looks great @szhan https://github.com/szhan.
@astheeggeggs https://github.com/astheeggeggs, what do you think? It would be really helpful to have a direct matrix-based implementation of the full "all haplotypes" model we can compare with, and this seems like the most natural way. Would be great to get your thoughts here.
— Reply to this email directly, view it on GitHub https://github.com/astheeggeggs/lshmm/pull/31#pullrequestreview-1954297205, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVQA77ORBRY3VYMCOP7QA3YZPYSBAVCNFSM6AAAAABFB2M6BGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTSNJUGI4TOMRQGU . You are receiving this because you were mentioned.Message ID: @.***>
Is there a point to add tests for cases where we expect at least two equally likely paths? It seems quite easy to encounter cases with more than one best path when the number of reference haplotypes and number of sites grow.
I have reworked the tests, which now cover the following queries copying from:
NONCOPY
states.MISSING
state at the switch position.MISSING
state left of the switch position.In the test ref. panels, all the ancestors carry at least one NONCOPY
state, and all the samples carry no NONCOPY
state.
Instead of asserting that the expected path and the actual path are identical, the new tests assert that the log-likelihood of the expected path and that of the actual path are approximately equal (i.e. passing np.allclose
). This avoids needing to deal with equally likely paths that may not be so obvious (it gets too tedious to check anyway).
Just to add a note here that in tsinfer n
(i.e. number of match nodes) is constant along the genome, given a ref. panel. So, the above tests should test the behavior of LS HMM Viterbi algorithm as in tsinfer, I think?
I have two sets of tests passing now. One set (test_API_noncopy_manual.py
) contains manually crafted test examples, and the other set (test_API_noncopy.py
) has more or less the same set of tests as test_API_multiallelic.py
.
In both sets of tests, NONCOPY
characters in the reference panel are not counted when computing the number of alleles per site (i.e. n_alleles
), which is used for mutation rate scaling.
The set of tests to keep is in test_API_noncopy.py
. There are several major differences between this test file and test_API_multiallelic.py
, on which it is heavily based.
get_ancestral_haplotypes
) to extract ancestral haplotypes from it. When there is recombination in the past, partial haplotypes with NONCOPY
denoting non-ancestral material are incorporated into the reference panel.NONCOPY
and MISSING
characters are not counted as distinct alleles (see related #33).Probably I should delete test_API_noncopy_manual.py
, unless some of the hand-crafted examples may be useful for testing. @jeromekelleher ?
Otherwise, I think the tests are ready for review @astheeggeggs, besides some more minor tweaks I'd like to make.
Ah, by modifying lshmm/api.py
, both the haploid and diploid versions are affected, but I haven't added tests for the diploid version.
Also, I think we should probably include assertions that none of the most likely paths go through a base in the reference panel that is equal to NONCOPY
.
Looking at this again, I think it needs a rethink. We have emission probabilities that don't sum to 1, which doesn't make sense in the usual HMM framework.
I'm going to scrap most of the code here, since it makes sense to do refactoring before adding new features. I'll probably just keep some of the handcrafted test cases and the routine that gets a ref. panel with NONCOPY states.
Aside from rethinking about this, we should probably refactor the code a bit first. See #37
I've deleted this branch of code, but it is probably useful to keep the discussion here.
Thanks @astheeggeggs ! Please do not merge the code in this branch! Please look at the code in #41 instead.
Closing thie PR because it's replaced by PR #41.
If ancestral haplotypes are used as part of a reference panel for LS matching, then non-copying states in them need to be treated differently than other states. When a non-copying state is encountered, the emission probability should be 0, because it is not allowed to copy from it. We arbitrarily set the numeric value of non-copying states to be
-2
, which is distinct from missing state that is set to-1
. Here, we modify how emission probabilities are computed in the haploid version of the Viterbi algorithm. We also add examples of reference panels, queries, and expected paths for testing.