GreenGenes training file doesn't work with DADA2's tutorial data set

montoyah commented 8 years ago

Hi,

I can assignTaxonomy to the DADA2 tutorial's data set by using RDP and Silva's training files, but when I use the GreenGenes one (gg_13_8_train_set_97.fa.gz), the phyloseq taxa abundance plot yields the error "family"...not found. Any idea what's going on?

Here's the session info (Ubunutu 14.04):

sessionInfo() R version 3.2.5 (2016-04-14) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.1 LTS

locale: [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 LC_MONETARY=en_CA.UTF-8
[6] LC_MESSAGES=en_CA.UTF-8 LC_PAPER=en_CA.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] ggplot2_2.1.0 ShortRead_1.28.0 GenomicAlignments_1.6.3 SummarizedExperiment_1.0.2 Biobase_2.30.0
[6] Rsamtools_1.22.0 GenomicRanges_1.22.4 GenomeInfoDb_1.6.3 Biostrings_2.38.4 XVector_0.10.0
[11] IRanges_2.4.8 S4Vectors_0.8.11 BiocParallel_1.4.3 BiocGenerics_0.16.1 phyloseq_1.14.0
[16] dada2_0.99.10 Rcpp_0.12.4 devtools_1.11.1 BiocInstaller_1.20.1

loaded via a namespace (and not attached): [1] RColorBrewer_1.1-2 futile.logger_1.4.1 plyr_1.8.3 iterators_1.0.8 bitops_1.0-6 futile.options_1.0.0 tools_3.2.5
[8] zlibbioc_1.16.0 digest_0.6.9 nlme_3.1-127 memoise_1.0.0 gtable_0.2.0 lattice_0.20-33 mgcv_1.8-12
[15] igraph_1.0.1 Matrix_1.2-5 foreach_1.4.3 cluster_2.0.4 withr_1.0.1 hwriter_1.3.2 stringr_1.0.0
[22] multtest_2.26.0 ade4_1.7-4 grid_3.2.5 data.table_1.9.6 survival_2.39-2 RJSONIO_1.3-0 latticeExtra_0.6-28 [29] reshape2_1.4.1 lambda.r_1.1.7 magrittr_1.5 MASS_7.3-45 splines_3.2.5 codetools_0.2-14 scales_0.4.0
[36] permute_0.9-0 ape_3.4 colorspace_1.2-6 stringi_1.0-1 munsell_0.4.3 biom_0.3.12 vegan_2.3-5
[43] chron_2.3-47

Thanks in advance for any insights,

Oscar.

benjjneb commented 8 years ago

Can you clarify where you are getting an error?

Are you getting an error when running assignTaxonomy with the greenGenes reference? Or is the error just cropping up later after the tax_table is merged into a phyloseq object?

montoyah commented 8 years ago

I get the error while making the "top.20" plot. This is the matrix I get after assigning taxa:

[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "kBacteria" "pBacteroidetes" "cBacteroidia" "oBacteroidales" "fS24-7" "g" "s"
[2,] "kBacteria" "pBacteroidetes" "cBacteroidia" "oBacteroidales" "fS24-7" "g" "s"
[3,] "kBacteria" "pBacteroidetes" "cBacteroidia" "oBacteroidales" "fS24-7" "g" "s__"

I know I can do partial matching and that would solve the problem, I just want to make sure it's not a problem with the file, which I already donwloaded again to see if that would fix it but it didn't.

benjjneb commented 8 years ago

Is there an NA in the family column in the top 20 taxa in your data?

montoyah commented 8 years ago

It doesn't seem like there is any (and this is the tutorial's data set; everything):

ps.top20 phyloseq-class experiment-level object otu_table() OTU Table: [ 20 taxa and 19 samples ] sample_data() Sample Data: [ 19 samples by 4 sample variables ] tax_table() Taxonomy Table: [ 20 taxa by 7 taxonomic ranks ]

benjjneb commented 8 years ago

I see, you are following the tutorial but switching in the GG reference.

In the tutorial, immediately after running assignTaxonomy(...) there is a command to name the columns of the taxa matrix by the phylogenetic rank. Change that to:

colnames(taxa) <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")

The command in the toturial (same, except no "Species") fails for the GG reference, because GG goes down to species level rather than just Genus level.

montoyah commented 8 years ago

That made it! Thanks a lot, and thanks for the hard work on developing this pipeline.

benjjneb / dada2

GreenGenes training file doesn't work with DADA2's tutorial data set #68