Check probability encoding of SNPs

timothymillar commented 4 years ago

Currently the error prob is being treated as "probability that the true allele it is not the called allele". The alternate interpretation is "Probability that call is meaningless" in which case some portion of that probability should be assigned to the called allele i.e. the calling went wrong but happened to result in a correct call.

This difference has a negligible affect on haplotype calls but has technical implementations when incorporating MNPs #51.

timothymillar commented 4 years ago

There appears to be 2 ways to interpret an error rate e for a SNP call

The probability that the called allele is correct = 1 - e and the probability that another allele is the true allele = e / (a - 1) where a is the number of possible alleles i.e 4
Error rate is the probability that the call is a random choice including a the correct allele e.g. the probability that the called allele is correct = (1 - e) + e / a and the probability of any other allele being correct = e / a

In the case of an enforced bi-allelic SNP (i.e. a constraint that the allele call must be one of two possibilities) there are two ways to generalize the above calculations such that the probabilities still sum to 1:

a. Use a = 2 b. Calculate as though all 4 alleles are possible then remove non-allowed alleles and normalize the result

For example in in a unconstrained SNP with a reference call and e = 0.3 then the possible encodings are:

1 = [0.7, 0.1, 0.1, 0.1]
2 = [0.775, 0.075, 0.075, 0.075]

If we add the constrain that the allele is bi-allelic then the possible (normalized) encodings are:

1a = [0.7, 0.3]
1b = [0.875, 0.125] (= [0.7, 0.1] / 0.8)
2a = [0.85, 0.15] (= [0.7, 0.0] + 0.3/2)
2b = [0.91176..., 0.0882...]

If a non-allowed call has been made then the allele is encoded as a gap -. In practice the error rates are much smaller than 0.3 so the difference will be much smaller.

Currently we are using option 1a. Interpretation 1 of error rate appears to be more common and is probably the most intuitive. Option 1a for constrained situations seems incorrect as it only increases the probability of the alt allele. Option 1b seems more appropriate but raises the question of is normalization necessary here?

If encoding MNPs #51 then non-normalized distributions arise due to reads with only partial coverage of the MNP.

timothymillar commented 4 years ago

Function for 2a if needed


def as_probabilistic(array, n_alleles, p=1.0, dtype=np.float):
    # check inputs
    array = np.array(array, copy=False)
    n_alleles = np.array(n_alleles, copy=False)
    p = np.array(p, copy=False)

    # special case for zero-length reads
    if array.shape[-1] == 0:
        return np.empty(array.shape + (0,), dtype=dtype)

    max_allele = np.max(n_alleles)
    n_alleles = n_alleles[..., None]

    # onehot encoding of alleles
    new = (array[..., None] == np.arange(0, max_allele)).astype(np.float)

    # probability that call is correct
    new *= p[..., None]

    # even probability if incorrect call
    new += (1 - np.expand_dims(p, -1)) / n_alleles

    # fill gaps with nans
    new[array < 0] = np.nan

    # fill non-alleles
    new[..., n_alleles <= np.arange(max_allele)] = 0

    return new

timothymillar commented 4 years ago

Function for 1b

def as_probabilistic(array, n_alleles, p=1.0, error_factor=3, dtype=np.float):
    array = np.array(array, copy=False)
    n_alleles = np.array(n_alleles, copy=False)
    error_factor = np.array(error_factor, copy=False)
    p = np.array(p, copy=False)

    # special case for zero-length reads
    if array.shape[-1] == 0:
        return np.empty(array.shape + (0,), dtype=dtype)

    vector_length = np.max(n_alleles)

    # onehot encoding of alleles
    onehot = array[..., None] == np.arange(vector_length)

    # basic probs
    new = ((1 - p) / error_factor)[..., None] * ~onehot
    calls = p[..., None] * onehot
    new[onehot] = calls[onehot]

    # zero out non-alleles
    new[..., n_alleles[..., None] <= np.arange(vector_length)] = 0

    # normalize 
    new /= np.nansum(new, axis=-1, keepdims=True)

    # nan fill gaps and re-zero
    new[array < 0] = np.nan
    new[..., n_alleles[..., None] <= np.arange(vector_length)] = 0
    return new

timothymillar commented 4 years ago

Decided on option 1b as this seems to be the best combination of intuitive and correctly interpreting error probability in a base call of a read.

The error rate is specified without knowledge of the potential bi/tri-allelic constraint
Therefore the error is the probability that the real allele is is not the called allele
There are three other nucleotides that could be the true call, so prob of each is error rate / 3
If we remove some of the options (bi/tri-allelic constraint) then the probabilities of remaining options should be scaled proportionally such that they sum to 1

This is not perfect because the bi/tri-allelic constraint implies additional prior knowledge that not all other alleles are equally likely but it seems to be the most sensible/conservative solution that doesn't require per-SNP input from the user.

timothymillar commented 4 years ago

Done in #57

timothymillar commented 4 years ago

If we remove some of the options (bi/tri-allelic constraint) then the probabilities of remaining options should be scaled proportionally such that they sum to 1

On second thought this is true in the general case for encoding a generic probability distribution but here we are technically just encoding the probability of identity between reads and haplotypes. This is only used in the context of the likelihood function P(reads | haplotypes) and hence the haplotypes are treated as constants. Any bi/tri-allelic constraints are applied to the haplotypes via the proposal distribution.

The likelihood function is asking "what is is the probability of this read coming from this genotype?" in which case the read can encode all possible alleles, not just those that the haplotypes are limited to. This means that a non-allowed variant call in a read could be encoded as e / a for all allowed alleles.

timothymillar commented 4 years ago

Done in #59

timothymillar commented 4 years ago

It also looks like this is a generalized equivalent to ANDSG and GATK

PlantandFoodResearch / MCHap

Check probability encoding of SNPs #56