Closed montoyah closed 8 years ago
Taxonomic assignments are different from OTUs (or dada2's output sequences).
What you have is multiple sequence variants in your dataset that are being assigned to the same genus. This is what happens when, for example, there are two different smithella strains in your samples.
Yes, I realized that. I think my question is, what is the standard procedure to deal with this type of data? I did some research about this issue and I haven't found a discussion or related covering the topic. Can the different OTUs of a genus just be added up to have a single genus abundance? Is there a function in DADA2 to deal with this kind of repeated OTUs? I apologize for my ignorance in this matter.
The resolution you want to include depends on you research question, but in general I would not recommend starting by throwing away sub-genera information. A crude example, but often someone would care about the difference between a garter snake and a rattler.
That said, doing analyses at higher taxonomic levels can be useful (albeit not where to start). Our friend phyloseq (see the tutorial for naturally transitioning dada2 data into phyloseq) handles this very well. The tax_glom function will be very useful for you if you want to analyze your data at the agglomerated genus level.
Thanks a lot for the explanation and the info about tax_glom, Ben. Have a great day.
Hi Ben,
I'm finding that in my final taxa output object (I'm not sure how to call a DADA2 final output), several OTUs are repeated. For example, in my top 10 I have two methanosaeta and two smithella OTUs, thus what I'm really getting is my top 8. Any idea why this could happen? I followed the MiSeq paired-end tutorial, and here's is my session info:
locale: [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C LC_TIME=en_CA.UTF-8
[4] LC_COLLATE=en_CA.UTF-8 LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods
[9] base
other attached packages: [1] pander_0.6.0 ShortRead_1.30.0 GenomicAlignments_1.8.0
[4] SummarizedExperiment_1.2.1 Biobase_2.32.0 Rsamtools_1.24.0
[7] GenomicRanges_1.24.0 GenomeInfoDb_1.8.1 Biostrings_2.40.0
[10] XVector_0.12.0 IRanges_2.6.0 S4Vectors_0.10.0
[13] BiocParallel_1.6.1 BiocGenerics_0.18.0 phyloseq_1.16.1
[16] dada2_1.1.0 Rcpp_0.12.5 devtools_1.11.1
[19] BiocInstaller_1.22.2 RColorBrewer_1.1-2 ggplot2_2.1.0
[22] reshape2_1.4.1 dplyr_0.4.3
loaded via a namespace (and not attached): [1] ape_3.4 lattice_0.20-33 assertthat_0.1 digest_0.6.9
[5] foreach_1.4.3 R6_2.1.2 plyr_1.8.3 chron_2.3-47
[9] httr_1.1.0 zlibbioc_1.18.0 curl_0.9.7 data.table_1.9.6
[13] vegan_2.3-5 Matrix_1.2-6 rmarkdown_0.9.6 labeling_0.3
[17] splines_3.3.0 stringr_1.0.0 igraph_1.0.1 munsell_0.4.3
[21] multtest_2.28.0 mgcv_1.8-12 htmltools_0.3.5 biomformat_1.0.1
[25] codetools_0.2-14 permute_0.9-0 withr_1.0.1 MASS_7.3-45
[29] bitops_1.0-6 grid_3.3.0 nlme_3.1-128 jsonlite_0.9.20
[33] gtable_0.2.0 DBI_0.4-1 git2r_0.15.0 magrittr_1.5
[37] scales_0.4.0 RcppParallel_4.3.19 stringi_1.0-1 hwriter_1.3.2
[41] latticeExtra_0.6-28 iterators_1.0.8 tools_3.3.0 ade4_1.7-4
[45] survival_2.39-4 yaml_2.1.13 colorspace_1.2-6 rhdf5_2.16.0
[49] cluster_2.0.4 memoise_1.0.0 knitr_1.13
Thanks in advance.