benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

Questions about "pool" option in dada function #1143

Closed Listen-Lii closed 3 years ago

Listen-Lii commented 4 years ago

Dear Dr. Callahan,
I ran DADA2 on a big data set (700 samples with over 63 million reads) but the parameter (pool = TRUE or FALSE in dada function) confused me a lot. Here are my questions:

  1. From the original paper, “dada2: high-resolution sample inference from Illumina amplicon data”, a total of five datasets were used to test DADA2. Three mock community data sets: HMP, Balanced, and Extreme data were run by each sample individually (pool = FALSE), and two actual data sets: pregnant microbiome data, the mouse fecal data were run by pooled samples (pool = TRUE) according to R code. Why did these data sets choose different parameters?

  2. From the DADA2 official workflows on small multi-sample dataset and big data (http://benjjneb.github.io/dada2/index.html), they used pool = FALSE parameter. According to my data, results can be obtained by ~3 days by pool = FALSE. However, by pool = TRUE, the program was killed by our server after ~15 days of running. I would like to ask if you have tested the speed of DADA2 on a large data set like ours?

  3. Based on Q2, I then chose five samples from our data set and test these two methods. All parameters were kept the same, except for pool = FALSE or TRUE in dada function. Surprisingly, these two methods showed great difference in ASV numbers. Un-pooled method got fewer ASV sequences, but ASV number was 2-8 folds lower than pooled method. Why are the results so different? Our research is based on the most accurate number of species possible, so which method should I use? Any guidance you can give me will be greatly appreciated. Sincerely, Thanks

termithorbor commented 4 years ago

This is what the tutorial writes about it: https://benjjneb.github.io/dada2/tutorial.html

Extensions: By default, the dada function processes each sample independently. However, pooling information across samples can increase sensitivity to sequence variants that may be present at very low frequencies in multiple samples. The dada2 package offers two types of pooling. dada(..., pool=TRUE) performs standard pooled processing, in which all samples are pooled together for sample inference. dada(..., pool="pseudo") performs pseudo-pooling, in which samples are processed independently after sharing information between samples, approximating pooled sample inference in linear time.

So as I understand it the pseudo approach could be quite suitable for your data.

Listen-Lii commented 4 years ago

Thank you for your prompt reply! I will try the pseudo approach and give you the feedback.

Listen-Lii commented 4 years ago

Thank you for your suggestions and pseudo approach indeed got more ASVs than the pooling approach. Additionally, running time from pseudo approach is a little more than pooling approch, but much less than un-pooling approach.