Derivation of error profile

ghost commented 6 years ago

My sense from the paper is that the alpha, beta, and indel parameters were empirically determined from runs within the Knight lab. Can you clarify how those reads were generated? Were they 2x150 or 1x150 or 2x250? Would you expect these parameters to be lower for 2x250 than for 1x150?

Regardless of whether it's on a per base or sequence basis, why not use a binomial distribution to generate the beta terms? If the per-base error rate is 0.005, then I'd expect the beta values to look something more like this...

> dbinom(c(0:11), prob=0.005, size=250)
 [1] 2.856079e-01 3.588039e-01 2.244778e-01 9.325041e-02 2.893574e-02
 [6] 7.153962e-03 1.467940e-03 2.571267e-04 3.924735e-05 5.303104e-06
[11] 6.422352e-07 7.041409e-08

Can you shed light on where the betas come from and perhaps why a more straightforward model wasn't used to pick the values?

amnona commented 6 years ago

Hi, lots of question :) let's go one by one:

mean_error (0.005) is the mean per-base error probability. Note that this is not the hamming-distance upper bound on the error but rather the average probability of getting and error in a random base from the sequence. It is used to infer the original number of reads for a given sequence (before all the errors) given the number of reads observed for this sequencing (after the sequencing). This is done using S'=S*(1-mean_error)^read_length (to get an error free read we need no error at the first position, second position, etc.). While the mean_error could be folded into the upper bound error profile (so we could just use the hamming distance dependent upper bound error profile), separating the two enables deblur to be independent of the read length (otherwise the read length would affect the error profile).
the indel error rate is the per-nucleotide upper probability upper bound for getting an insertion/deletion. We use it in a slightly different way compared to mismatch (so we count up to 3 indels as one event, and say more than 3 indels has probability 0) since these errors originate differently.
We also have the hamming distance dependent mismatch probability upper bound. Since mismatch probability depends on a lot of factors (such as local sequence, read position), we do not use the mean error probability (which can be too low in some cases) but rather an upper bound on the probability (so we will not get false positives).
for the mismatch probability upper bound, we use a hamming dependent profile rather than just assume independence (i think this is what you meant by "binomial"), since based on our observations, errors are not independent. So the probability of getting 2 mismatches in the same sequence is not the probability of getting one error squared.

In general, we based the error modeling and profile on multiple data: mock mixtures (where we know more or less the ground truth), runs with a spike in a known sequence (so again we can see how the errors diffuse from this known sequence), and regular experiments (where if we see a lot of bacteria behaving similarly on the samples, with one much lower than the other, we can assume it is a read error). The error modeling is based on forward reads, of length 150. Based on our experience, reverse reads have a much higher error rate, and in the current deblur implementation in which the error profile is position independent, using an upper bound based on the reverse reads would incur a large cost for the forward reads. Regarding the read length, the main factor the affects the performance of deblur on >150bp reads is the mean_error. The 0.005 is a bit on the high side for most illumina runs, and since it is used to the power of the read length, if we use very long reads, the effect is bigger, and we get a more conservative performance of deblur (so it may remove more sequences that are not due to read error).

Does this make sense? Let me know if there are more questions.

Cheers, Amnon

On Wed, Aug 29, 2018 at 8:07 PM polypay123 notifications@github.com wrote:

My sense from the paper is that the alpha, beta, and indel parameters were empirically determined from runs within the Knight lab. Can you clarify how those reads were generated? Were they 2x150 or 1x150 or 2x250? Would you expect these parameters to be lower for 2x250 than for 1x150?

I'm also a bit confused as to what the parameters represent. In the source code https://github.com/biocore/deblur/blob/master/deblur/deblurring.py#L82 it states that mean_error (aka alpha from the paper) is 0.005, which is the "mean illumina error". Is that a per base error rate (i.e. 0.5% of all bases are incorrect) or a per sequence error rate (i.e. 0.5% of reads have at least one error)? Also, is the indel error rate on a per base or sequence basis? Regardless of whether it's on a per base or sequence basis, why not use a binomial distribution to generate the beta terms?

Thanks for the clarification and sorry for all the questions!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/biocore/deblur/issues/180, or mute the thread https://github.com/notifications/unsubscribe-auth/AFkA8pP51bDSOGRSeOafIOZRNhe40mq-ks5uVspUgaJpZM4WR95n .

wasade commented 6 years ago

Closing as it seems like the questions were addressed. Please reopen if needed.

biocore / deblur

Derivation of error profile #180