MISSING as a distinct allele?

szhan commented 6 months ago

To scale mutation rates per site position, we need to figure out the number of distinct alleles at each site position. I'm trying to understand how MISSING state is factored in when computing per-site emission probabilities when mutation rates are scaled.

I'm looking at a code snippet in test_API.py file here.

n_alleles = np.int8(
    [len(np.unique(np.append(H[j, :], s[:, j]))) for j in range(m)]
)

H = ref. haplotypes, s = query haplotype, and m = number of sites.

It gets the number of distinct alleles in both the ref. panel and query at each site j, including MISSING.

n_alleles is then fed to haplotype_emission here.

The following code snippet computes the emission probabilities when doing scaling.

if scale_mutation_based_on_n_alleles:
    e[:, 0] = mu - mu * np.equal(
        n_alleles, np.ones(m)
    )  # Added boolean in case we're at an invariant site
    e[:, 1] = 1 - (n_alleles - 1) * mu

MISSING isn't really an allele, right? It is a state in the HMM, as I understand it. So, the interpretation is a bit weird to me. If we treat MISSING as another allele in the LS HMM, then is it like saying that a query can mutate to MISSING at a site? Ahhhh, maybe I'm misunderstanding this altogether?

szhan commented 6 months ago

I've chatted with @jeromekelleher about this. It makes sense that MISSING (and NONCOPY) should not be counted as distinct alleles when mutation rates are scaled.

If I'm understanding these code snippets correctly, then MISSING is counted when computing n_alleles.

https://github.com/astheeggeggs/lshmm/blob/f94dd055fc3a68366950a7806d9521232fcef07e/lshmm/api.py#L28

https://github.com/astheeggeggs/lshmm/blob/f94dd055fc3a68366950a7806d9521232fcef07e/lshmm/api.py#L134

In #31, I modified those functions to exclude MISSING and NONCOPY when computing n_alleles.

@astheeggeggs Can you confirm my understanding when you get the chance? Thanks!

astheeggeggs commented 6 months ago

I definitely agree that it shouldn't be counted as a distinct allele. It looks like you're right with regard to the function, well spotted.

szhan commented 6 months ago

I'm going to fix this in a PR separate from #31.

astheeggeggs / lshmm

MISSING as a distinct allele? #33