Closed ghost closed 6 years ago
Hi, lots of question :) let's go one by one:
In general, we based the error modeling and profile on multiple data: mock mixtures (where we know more or less the ground truth), runs with a spike in a known sequence (so again we can see how the errors diffuse from this known sequence), and regular experiments (where if we see a lot of bacteria behaving similarly on the samples, with one much lower than the other, we can assume it is a read error). The error modeling is based on forward reads, of length 150. Based on our experience, reverse reads have a much higher error rate, and in the current deblur implementation in which the error profile is position independent, using an upper bound based on the reverse reads would incur a large cost for the forward reads. Regarding the read length, the main factor the affects the performance of deblur on >150bp reads is the mean_error. The 0.005 is a bit on the high side for most illumina runs, and since it is used to the power of the read length, if we use very long reads, the effect is bigger, and we get a more conservative performance of deblur (so it may remove more sequences that are not due to read error).
Does this make sense? Let me know if there are more questions.
Cheers, Amnon
On Wed, Aug 29, 2018 at 8:07 PM polypay123 notifications@github.com wrote:
My sense from the paper is that the alpha, beta, and indel parameters were empirically determined from runs within the Knight lab. Can you clarify how those reads were generated? Were they 2x150 or 1x150 or 2x250? Would you expect these parameters to be lower for 2x250 than for 1x150?
I'm also a bit confused as to what the parameters represent. In the source code https://github.com/biocore/deblur/blob/master/deblur/deblurring.py#L82 it states that mean_error (aka alpha from the paper) is 0.005, which is the "mean illumina error". Is that a per base error rate (i.e. 0.5% of all bases are incorrect) or a per sequence error rate (i.e. 0.5% of reads have at least one error)? Also, is the indel error rate on a per base or sequence basis? Regardless of whether it's on a per base or sequence basis, why not use a binomial distribution to generate the beta terms?
Thanks for the clarification and sorry for all the questions!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/biocore/deblur/issues/180, or mute the thread https://github.com/notifications/unsubscribe-auth/AFkA8pP51bDSOGRSeOafIOZRNhe40mq-ks5uVspUgaJpZM4WR95n .
Closing as it seems like the questions were addressed. Please reopen if needed.
My sense from the paper is that the
alpha
,beta
, andindel
parameters were empirically determined from runs within the Knight lab. Can you clarify how those reads were generated? Were they 2x150 or 1x150 or 2x250? Would you expect these parameters to be lower for 2x250 than for 1x150?Regardless of whether it's on a per base or sequence basis, why not use a binomial distribution to generate the beta terms? If the per-base error rate is 0.005, then I'd expect the
beta
values to look something more like this...Can you shed light on where the betas come from and perhaps why a more straightforward model wasn't used to pick the values?