Closed szhan closed 5 months ago
I've chatted with @jeromekelleher about this. It makes sense that MISSING
(and NONCOPY
) should not be counted as distinct alleles when mutation rates are scaled.
If I'm understanding these code snippets correctly, then MISSING
is counted when computing n_alleles
.
https://github.com/astheeggeggs/lshmm/blob/f94dd055fc3a68366950a7806d9521232fcef07e/lshmm/api.py#L28
In #31, I modified those functions to exclude MISSING
and NONCOPY
when computing n_alleles
.
@astheeggeggs Can you confirm my understanding when you get the chance? Thanks!
I definitely agree that it shouldn't be counted as a distinct allele. It looks like you're right with regard to the function, well spotted.
I'm going to fix this in a PR separate from #31.
To scale mutation rates per site position, we need to figure out the number of distinct alleles at each site position. I'm trying to understand how MISSING state is factored in when computing per-site emission probabilities when mutation rates are scaled.
I'm looking at a code snippet in
test_API.py
file here.H
= ref. haplotypes,s
= query haplotype, andm
= number of sites.It gets the number of distinct alleles in both the ref. panel and query at each site
j
, includingMISSING
.n_alleles
is then fed tohaplotype_emission
here.The following code snippet computes the emission probabilities when doing scaling.
MISSING
isn't really an allele, right? It is a state in the HMM, as I understand it. So, the interpretation is a bit weird to me. If we treatMISSING
as another allele in the LS HMM, then is it like saying that a query can mutate toMISSING
at a site? Ahhhh, maybe I'm misunderstanding this altogether?