benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
470 stars 142 forks source link

Repeated OTUs in final taxa #80

Closed montoyah closed 8 years ago

montoyah commented 8 years ago

Hi Ben,

I'm finding that in my final taxa output object (I'm not sure how to call a DADA2 final output), several OTUs are repeated. For example, in my top 10 I have two methanosaeta and two smithella OTUs, thus what I'm really getting is my top 8. Any idea why this could happen? I followed the MiSeq paired-end tutorial, and here's is my session info:

sessionInfo() R version 3.3.0 (2016-05-03) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.4 LTS

locale: [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C LC_TIME=en_CA.UTF-8
[4] LC_COLLATE=en_CA.UTF-8 LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods
[9] base

other attached packages: [1] pander_0.6.0 ShortRead_1.30.0 GenomicAlignments_1.8.0
[4] SummarizedExperiment_1.2.1 Biobase_2.32.0 Rsamtools_1.24.0
[7] GenomicRanges_1.24.0 GenomeInfoDb_1.8.1 Biostrings_2.40.0
[10] XVector_0.12.0 IRanges_2.6.0 S4Vectors_0.10.0
[13] BiocParallel_1.6.1 BiocGenerics_0.18.0 phyloseq_1.16.1
[16] dada2_1.1.0 Rcpp_0.12.5 devtools_1.11.1
[19] BiocInstaller_1.22.2 RColorBrewer_1.1-2 ggplot2_2.1.0
[22] reshape2_1.4.1 dplyr_0.4.3

loaded via a namespace (and not attached): [1] ape_3.4 lattice_0.20-33 assertthat_0.1 digest_0.6.9
[5] foreach_1.4.3 R6_2.1.2 plyr_1.8.3 chron_2.3-47
[9] httr_1.1.0 zlibbioc_1.18.0 curl_0.9.7 data.table_1.9.6
[13] vegan_2.3-5 Matrix_1.2-6 rmarkdown_0.9.6 labeling_0.3
[17] splines_3.3.0 stringr_1.0.0 igraph_1.0.1 munsell_0.4.3
[21] multtest_2.28.0 mgcv_1.8-12 htmltools_0.3.5 biomformat_1.0.1
[25] codetools_0.2-14 permute_0.9-0 withr_1.0.1 MASS_7.3-45
[29] bitops_1.0-6 grid_3.3.0 nlme_3.1-128 jsonlite_0.9.20
[33] gtable_0.2.0 DBI_0.4-1 git2r_0.15.0 magrittr_1.5
[37] scales_0.4.0 RcppParallel_4.3.19 stringi_1.0-1 hwriter_1.3.2
[41] latticeExtra_0.6-28 iterators_1.0.8 tools_3.3.0 ade4_1.7-4
[45] survival_2.39-4 yaml_2.1.13 colorspace_1.2-6 rhdf5_2.16.0
[49] cluster_2.0.4 memoise_1.0.0 knitr_1.13

Thanks in advance.

benjjneb commented 8 years ago

Taxonomic assignments are different from OTUs (or dada2's output sequences).

What you have is multiple sequence variants in your dataset that are being assigned to the same genus. This is what happens when, for example, there are two different smithella strains in your samples.

montoyah commented 8 years ago

Yes, I realized that. I think my question is, what is the standard procedure to deal with this type of data? I did some research about this issue and I haven't found a discussion or related covering the topic. Can the different OTUs of a genus just be added up to have a single genus abundance? Is there a function in DADA2 to deal with this kind of repeated OTUs? I apologize for my ignorance in this matter.

benjjneb commented 8 years ago

The resolution you want to include depends on you research question, but in general I would not recommend starting by throwing away sub-genera information. A crude example, but often someone would care about the difference between a garter snake and a rattler.

That said, doing analyses at higher taxonomic levels can be useful (albeit not where to start). Our friend phyloseq (see the tutorial for naturally transitioning dada2 data into phyloseq) handles this very well. The tax_glom function will be very useful for you if you want to analyze your data at the agglomerated genus level.

montoyah commented 8 years ago

Thanks a lot for the explanation and the info about tax_glom, Ben. Have a great day.