Closed timothymillar closed 4 years ago
There appears to be 2 ways to interpret an error rate e for a SNP call
In the case of an enforced bi-allelic SNP (i.e. a constraint that the allele call must be one of two possibilities) there are two ways to generalize the above calculations such that the probabilities still sum to 1:
a. Use a = 2 b. Calculate as though all 4 alleles are possible then remove non-allowed alleles and normalize the result
For example in in a unconstrained SNP with a reference call and e = 0.3 then the possible encodings are:
[0.7, 0.1, 0.1, 0.1]
[0.775, 0.075, 0.075, 0.075]
If we add the constrain that the allele is bi-allelic then the possible (normalized) encodings are:
[0.7, 0.3]
[0.875, 0.125]
(= [0.7, 0.1] / 0.8
)[0.85, 0.15]
(= [0.7, 0.0] + 0.3/2
)[0.91176..., 0.0882...]
If a non-allowed call has been made then the allele is encoded as a gap -
.
In practice the error rates are much smaller than 0.3 so the difference will be much smaller.
Currently we are using option 1a. Interpretation 1 of error rate appears to be more common and is probably the most intuitive. Option 1a for constrained situations seems incorrect as it only increases the probability of the alt allele. Option 1b seems more appropriate but raises the question of is normalization necessary here?
If encoding MNPs #51 then non-normalized distributions arise due to reads with only partial coverage of the MNP.
Function for 2a if needed
def as_probabilistic(array, n_alleles, p=1.0, dtype=np.float):
# check inputs
array = np.array(array, copy=False)
n_alleles = np.array(n_alleles, copy=False)
p = np.array(p, copy=False)
# special case for zero-length reads
if array.shape[-1] == 0:
return np.empty(array.shape + (0,), dtype=dtype)
max_allele = np.max(n_alleles)
n_alleles = n_alleles[..., None]
# onehot encoding of alleles
new = (array[..., None] == np.arange(0, max_allele)).astype(np.float)
# probability that call is correct
new *= p[..., None]
# even probability if incorrect call
new += (1 - np.expand_dims(p, -1)) / n_alleles
# fill gaps with nans
new[array < 0] = np.nan
# fill non-alleles
new[..., n_alleles <= np.arange(max_allele)] = 0
return new
Function for 1b
def as_probabilistic(array, n_alleles, p=1.0, error_factor=3, dtype=np.float):
array = np.array(array, copy=False)
n_alleles = np.array(n_alleles, copy=False)
error_factor = np.array(error_factor, copy=False)
p = np.array(p, copy=False)
# special case for zero-length reads
if array.shape[-1] == 0:
return np.empty(array.shape + (0,), dtype=dtype)
vector_length = np.max(n_alleles)
# onehot encoding of alleles
onehot = array[..., None] == np.arange(vector_length)
# basic probs
new = ((1 - p) / error_factor)[..., None] * ~onehot
calls = p[..., None] * onehot
new[onehot] = calls[onehot]
# zero out non-alleles
new[..., n_alleles[..., None] <= np.arange(vector_length)] = 0
# normalize
new /= np.nansum(new, axis=-1, keepdims=True)
# nan fill gaps and re-zero
new[array < 0] = np.nan
new[..., n_alleles[..., None] <= np.arange(vector_length)] = 0
return new
Decided on option 1b as this seems to be the best combination of intuitive and correctly interpreting error probability in a base call of a read.
This is not perfect because the bi/tri-allelic constraint implies additional prior knowledge that not all other alleles are equally likely but it seems to be the most sensible/conservative solution that doesn't require per-SNP input from the user.
Done in #57
If we remove some of the options (bi/tri-allelic constraint) then the probabilities of remaining options should be scaled proportionally such that they sum to 1
On second thought this is true in the general case for encoding a generic probability distribution but here we are technically just encoding the probability of identity between reads and haplotypes. This is only used in the context of the likelihood function P(reads | haplotypes) and hence the haplotypes are treated as constants. Any bi/tri-allelic constraints are applied to the haplotypes via the proposal distribution.
The likelihood function is asking "what is is the probability of this read coming from this genotype?" in which case the read can encode all possible alleles, not just those that the haplotypes are limited to. This means that a non-allowed variant call in a read could be encoded as e / a for all allowed alleles.
Done in #59
It also looks like this is a generalized equivalent to ANDSG and GATK
Currently the error prob is being treated as "probability that the true allele it is not the called allele". The alternate interpretation is "Probability that call is meaningless" in which case some portion of that probability should be assigned to the called allele i.e. the calling went wrong but happened to result in a correct call.
This difference has a negligible affect on haplotype calls but has technical implementations when incorporating MNPs #51.