benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
474 stars 143 forks source link

Question regarding the computation of the p-value #1853

Closed pereiramemo closed 11 months ago

pereiramemo commented 1 year ago

Dear DADA2 developers,

There is section in the paper that describes how DADA2 works (https://www.nature.com/articles/nmeth.3869), that I am not sure how to interpret. That is, "The abundance p-value" section in the "Online Methods". Here it says: "Let unique sequence i with abundance ai be in partition j containing nj reads. Then, conditional on i being read at least once, the abundance p- value is the probability of seeing nj or more identical reads (ρpois is the Poisson density function)"

My understanding is that here it should say "the probability of seeing aj or more identical reads". Could please confirm that the text in the paper is right.

Many thanks,

Emiliano

benjjneb commented 1 year ago

Yes, that looks like a mistake in our notation. Your suggested fix is mostly correct, except that following our subscript notation it would be "a_i or more identical reads" (not a_j).

pereiramemo commented 1 year ago

Yes, a_i. Thanks for the response!

pereiramemo commented 1 year ago

Dear Dr. Benjamin Callahan,

I have one more question that I wanted to ask regarding how the expected number of reads of sample sequence j is determined (i.e., _nj).

Specifically, I do not understand why the probability of obtaining a zero-error sequence is not considered when determining the _nj that is used to compute the expectation of the Poisson distributions (i.e., _n_j*lambdaji when testing if sequence i can be explained by errors).

My logic is that there is real sample abundance, call it _n_jsample, and what we observe, that is _nj, is actually this real sample abundance times the probability of obtaining a zero error sequence: _n_j_samplelambdajj.
In this case, the expected sample number of reads would be _n_j/lambdajj (i.e., _n_j_sample = n_j/lambdajj) and we would use this _n_jsample to compute the expectation of the Poisson distributions (i.e., _n_j_sample
lambdaji when testing if sequence i can be explained by errors).

Thus my question is: would it make sense to consider the probability of obtaining a zero-error sequence to determine _nj that is used to compute the expectation of the Poisson distributions (i.e., _n_j*lambdaji)?

I hope this question makes sense and I apologize for taking your time if it does not.

Many thanks,

Emiliano

benjjneb commented 11 months ago

Agree that the potential error sequences need to be accounted for when defining the parameter of the poisson distribution evaluating whether errors could explain the observed reads.

In dada2 at the algorithimc level, the partition abundance (n) is all the reads that are at that point considered as potentially derived from the partition-defining sequence.

This approach is generally more conservative in calling new ASVs.

pereiramemo commented 11 months ago

This clarifies my understanding. Many thanks for your answer.

Best,

Emiliano