non-unique value when setting 'row.names' when loading phyloseq object

YiweiNiu commented 5 years ago

Hi,

I tried to load phyloseq object.

library(ampvis2)
otutable <- data.frame(OTU = rownames(t(otu_table(ps)@.Data)),
                       t(otu_table(ps)@.Data),
                       tax_table(ps)@.Data,
                       check.names = FALSE
                       )

#Extract metadata from the phyloseq object:
metadata <- data.frame(sample_data(ps), 
                       check.names = FALSE
                       )

#Load the data with amp_load:
ps.av2 <- amp_load(otutable, metadata, tree = phy_tree(ps))

But got the following error.

Error in `.rowNamesDF<-`(x, value = value) : duplicate 'row.names' are not allowed
6. stop("duplicate 'row.names' are not allowed")
5. `.rowNamesDF<-`(x, value = value)
4. `row.names<-.data.frame`(`*tmp*`, value = value)
3. `row.names<-`(`*tmp*`, value = value)
2. `rownames<-`(`*tmp*`, value = as.character(metadata[, 1]))
1. amp_load(otutable, metadata, tree = phy_tree(ps))

I did not see any duplicates in the row names of otutable or metadata.

> dim(otutable)
[1] 522  13
> dim(metadata)
[1] 5 4
> head(otutable)
      OTU S6-1-1 S6-1-2 S6-1-3 S6-10-1 S6-5-2  Kingdom     Phylum      Class         Order
ASV1 ASV1   1033   1910    657     101     47 Bacteria Firmicutes Clostridia Clostridiales
ASV2 ASV2    759    964    369      62     30 Bacteria Firmicutes Clostridia Clostridiales
ASV3 ASV3    535    932    291      54     19 Bacteria Firmicutes Clostridia Clostridiales
ASV4 ASV4    406   1027    294      33     25 Bacteria Firmicutes Clostridia Clostridiales
ASV5 ASV5    400    494    230      25      8 Bacteria Firmicutes Clostridia Clostridiales
ASV6 ASV6    279    594    162      24     29 Bacteria Firmicutes Clostridia Clostridiales
              Family            Genus       Species
ASV1 Ruminococcaceae             <NA>          <NA>
ASV2 Ruminococcaceae             <NA>          <NA>
ASV3 Ruminococcaceae             <NA>          <NA>
ASV4 Lachnospiraceae      Lachnospira          <NA>
ASV5 Ruminococcaceae Faecalibacterium          <NA>
ASV6 Lachnospiraceae      Lachnospira pectinoschiza
> length(rownames(otutable))
[1] 522
> length(unique(rownames(otutable)))
[1] 522
> rownames(metadata)
[1] "S6-1-1"  "S6-1-2"  "S6-1-3"  "S6-10-1" "S6-5-2"

Here is the R session information.

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936  LC_CTYPE=Chinese (Simplified)_China.936   
[3] LC_MONETARY=Chinese (Simplified)_China.936 LC_NUMERIC=C                              
[5] LC_TIME=Chinese (Simplified)_China.936    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ampvis2_2.4.11              ggplot2_3.2.0               phyloseq_1.28.0            
 [4] DESeq2_1.24.0               SummarizedExperiment_1.14.0 DelayedArray_0.10.0        
 [7] BiocParallel_1.17.18        matrixStats_0.54.0          Biobase_2.44.0             
[10] GenomicRanges_1.36.0        GenomeInfoDb_1.20.0         Rcpp_1.0.1                 
[13] Biostrings_2.52.0           XVector_0.24.0              IRanges_2.18.1             
[16] S4Vectors_0.22.0            BiocGenerics_0.30.0        

loaded via a namespace (and not attached):
 [1] nlme_3.1-140             bitops_1.0-6             lubridate_1.7.4          bit64_0.9-7             
 [5] httr_1.4.0               doParallel_1.0.14        RColorBrewer_1.1-2       tools_3.6.0             
 [9] backports_1.1.4          vegan_2.5-5              R6_2.4.0                 rpart_4.1-15            
[13] mgcv_1.8-28              Hmisc_4.2-0              DBI_1.0.0                lazyeval_0.2.2          
[17] colorspace_1.4-1         permute_0.9-5            ade4_1.7-13              nnet_7.3-12             
[21] withr_2.1.2              tidyselect_0.2.5         gridExtra_2.3            bit_1.1-14              
[25] compiler_3.6.0           cli_1.1.0                htmlTable_1.13.1         network_1.15            
[29] plotly_4.9.0             scales_1.0.0             checkmate_1.9.3          genefilter_1.66.0       
[33] ggnet_0.1.0              stringr_1.4.0            digest_0.6.19            Rsamtools_2.0.0         
[37] foreign_0.8-71           base64enc_0.1-3          pkgconfig_2.0.2          htmltools_0.3.6         
[41] htmlwidgets_1.3          rlang_0.4.0              rstudioapi_0.10          RSQLite_2.1.1           
[45] hwriter_1.3.2            jsonlite_1.6             acepack_1.4.1            dplyr_0.8.2             
[49] RCurl_1.95-4.12          magrittr_1.5             GenomeInfoDbData_1.2.1   Formula_1.2-3           
[53] biomformat_1.12.0        Matrix_1.2-17            munsell_0.5.0            Rhdf5lib_1.6.0          
[57] ape_5.3                  stringi_1.4.3            MASS_7.3-51.4            zlibbioc_1.30.0         
[61] rhdf5_2.28.0             plyr_1.8.4               grid_3.6.0               blob_1.1.1              
[65] ggrepel_0.8.1            crayon_1.3.4             lattice_0.20-38          cowplot_0.9.4           
[69] splines_3.6.0            multtest_2.40.0          annotate_1.62.0          locfit_1.5-9.1          
[73] knitr_1.23               pillar_1.4.2             igraph_1.2.4.1           geneplotter_1.62.0      
[77] reshape2_1.4.3           codetools_0.2-16         XML_3.98-1.20            glue_1.3.1              
[81] latticeExtra_0.6-28      remotes_2.1.0            data.table_1.12.2        RcppParallel_4.4.3      
[85] foreach_1.4.4            tidyr_0.8.3              gtable_0.3.0             purrr_0.3.2             
[89] assertthat_0.2.1         xfun_0.8                 xtable_1.8-4             viridisLite_0.3.0       
[93] survival_2.44-1.1        tibble_2.1.3             iterators_1.0.10         GenomicAlignments_1.20.1
[97] AnnotationDbi_1.46.0     memoise_1.1.0            cluster_2.1.0

Do you have any ideas about this?

Bests, Yiwei Niu

KasperSkytte commented 5 years ago

Hi. I can take a look after my holidays ends after next week.

Regards / MVH Kasper Skytte Andersen

On 5 Jul 2019, at 10.41, Yiwei Niu notifications@github.com wrote:

Hi,

I tried to load phyloseq object.

library(ampvis2) otutable <- data.frame(OTU = rownames(t(otu_table(ps)@.Data)), t(otu_table(ps)@.Data), tax_table(ps)@.Data, check.names = FALSE )

Extract metadata from the phyloseq object:

metadata <- data.frame(sample_data(ps), check.names = FALSE )

Load the data with amp_load:

ps.av2 <- amp_load(otutable, metadata, tree = phy_tree(ps)) But got the following error.

Error in .rowNamesDF<-(x, value = value) : duplicate 'row.names' are not allowed

stop("duplicate 'row.names' are not allowed")

.rowNamesDF<-(x, value = value)

row.names<-.data.frame(*tmp*, value = value)

row.names<-(*tmp*, value = value)

rownames<-(*tmp*, value = as.character(metadata[, 1]))

amp_load(otutable, metadata, tree = phy_tree(ps)) I did not see any duplicates in the row names of otutable or metadata.

dim(otutable) [1] 522 13 dim(metadata) [1] 5 4 head(otutable) OTU S6-1-1 S6-1-2 S6-1-3 S6-10-1 S6-5-2 Kingdom Phylum Class Order ASV1 ASV1 1033 1910 657 101 47 Bacteria Firmicutes Clostridia Clostridiales ASV2 ASV2 759 964 369 62 30 Bacteria Firmicutes Clostridia Clostridiales ASV3 ASV3 535 932 291 54 19 Bacteria Firmicutes Clostridia Clostridiales ASV4 ASV4 406 1027 294 33 25 Bacteria Firmicutes Clostridia Clostridiales ASV5 ASV5 400 494 230 25 8 Bacteria Firmicutes Clostridia Clostridiales ASV6 ASV6 279 594 162 24 29 Bacteria Firmicutes Clostridia Clostridiales Family Genus Species ASV1 Ruminococcaceae ASV2 Ruminococcaceae ASV3 Ruminococcaceae ASV4 Lachnospiraceae Lachnospira ASV5 Ruminococcaceae Faecalibacterium ASV6 Lachnospiraceae Lachnospira pectinoschiza length(rownames(otutable)) [1] 522 length(unique(rownames(otutable))) [1] 522 rownames(metadata) [1] "S6-1-1" "S6-1-2" "S6-1-3" "S6-10-1" "S6-5-2" Here is the R session information.

sessionInfo() R version 3.6.0 (2019-04-26) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale: [1] LC_COLLATE=Chinese (Simplified)_China.936 LC_CTYPE=Chinese (Simplified)_China.936
[3] LC_MONETARY=Chinese (Simplified)_China.936 LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_China.936

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] ampvis2_2.4.11 ggplot2_3.2.0 phyloseq_1.28.0
[4] DESeq2_1.24.0 SummarizedExperiment_1.14.0 DelayedArray_0.10.0
[7] BiocParallel_1.17.18 matrixStats_0.54.0 Biobase_2.44.0
[10] GenomicRanges_1.36.0 GenomeInfoDb_1.20.0 Rcpp_1.0.1
[13] Biostrings_2.52.0 XVector_0.24.0 IRanges_2.18.1
[16] S4Vectors_0.22.0 BiocGenerics_0.30.0

loaded via a [1] nlme_3.1-140 [5] httr_1.4.0 [9] backports_1.1.4 [13] mgcv_1.8-28 [17] colorspace_1.4-1 [21] withr_2.1.2 [25] compiler_3.6.0 [29] plotly_4.9.0 [33] ggnet_0.1.0 [37] foreign_0.8-71 [41] htmlwidgets_1.3 [45] hwriter_1.3.2 [49] RCurl_1.95-4.12 [53] biomformat_1.12.0 [57] ape_5.3 [61] rhdf5_2.28.0 [65] ggrepel_0.8.1 [69] splines_3.6.0 [73] knitr_1.23 [77] reshape2_1.4.3 [81] latticeExtra_0.6-28 [85] foreach_1.4.4 [89] assertthat_0.2.1 [93] survival_2.44-1.1 [97] AnnotationDbi_1.46.0 Do you have any ideas about this? namespace (and not attached): bitops_1.0-6 lubridate_1.7.4 bit64_0.9-7
doParallel_1.0.14 RColorBrewer_1.1-2 tools_3.6.0
vegan_2.5-5 R6_2.4.0 rpart_4.1-15
Hmisc_4.2-0 DBI_1.0.0 lazyeval_0.2.2
permute_0.9-5 ade4_1.7-13 nnet_7.3-12
tidyselect_0.2.5 gridExtra_2.3 bit_1.1-14
cli_1.1.0 htmlTable_1.13.1 network_1.15
scales_1.0.0 checkmate_1.9.3 genefilter_1.66.0
stringr_1.4.0 digest_0.6.19 Rsamtools_2.0.0
base64enc_0.1-3 pkgconfig_2.0.2 htmltools_0.3.6
rlang_0.4.0 rstudioapi_0.10 RSQLite_2.1.1
jsonlite_1.6 acepack_1.4.1 dplyr_0.8.2
magrittr_1.5 GenomeInfoDbData_1.2.1 Formula_1.2-3
Matrix_1.2-17 munsell_0.5.0 Rhdf5lib_1.6.0
stringi_1.4.3 MASS_7.3-51.4 zlibbioc_1.30.0
plyr_1.8.4 grid_3.6.0 blob_1.1.1
crayon_1.3.4 lattice_0.20-38 cowplot_0.9.4
multtest_2.40.0 annotate_1.62.0 locfit_1.5-9.1
pillar_1.4.2 igraph_1.2.4.1 geneplotter_1.62.0
codetools_0.2-16 XML_3.98-1.20 glue_1.3.1
remotes_2.1.0 data.table_1.12.2 RcppParallel_4.4.3
tidyr_0.8.3 gtable_0.3.0 purrr_0.3.2
xfun_0.8 xtable_1.8-4 viridisLite_0.3.0
tibble_2.1.3 iterators_1.0.10 GenomicAlignments_1.20.1 memoise_1.1.0 cluster_2.1.0

Bests, Yiwei Niu

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

KasperSkytte commented 5 years ago

Hi again. You can check for duplicates with any(duplicates(rownames(otutable))), and see which are duplicates with which(duplicated(rownames(otutable))). There are duplicates somewhere, or else you wouldn't get that error. Maybe there are NA's or similar?

bstamps commented 5 years ago

I'm also having this issue- checking with any(duplicates(rownames(otutable))) and which(duplicated(rownames(otutable))) results in integer(0) and [1] FALSE, respectively.

bstamps commented 5 years ago

The below is a kludge adapted from this gist that works for DADA2 clustered phyloseq objects https://gist.github.com/shandley/87b0017ca593c58561a8b76b98985477 and appears to solve this issue, at least for me.

otu.table <- t(otu_table(ps, taxa_are_rows = FALSE))
otu.table <- as.data.frame(otu.table)
otu.table <- tibble::rownames_to_column(otu.table, var = "OTU")
head(rownames(otu.table))
head(colnames(otu.table))
nrow(otu.table)

tax.table <- as.data.frame(tax_table(ps))
tax.table <- tibble::rownames_to_column(tax.table, var = "OTU")
head(rownames(tax.table))
head(colnames(tax.table))
nrow(tax.table)

my_otu_table <- left_join(otu.table, tax.table, by = "OTU")
nrow(my_otu_table)
colnames(my_otu_table)
head(rownames(my_otu_table))

my_otu_table <- tibble::column_to_rownames(my_otu_table, var = "OTU")
head(rownames(my_otu_table))
colnames(my_otu_table)
nrow(my_otu_table)

my_metadata <- data.frame(phyloseq::sample_data(ps), 
                       check.names = FALSE
                       )
my_metadata <- tibble::rownames_to_column(my_metadata, "SampleID")
AV.Object <- amp_load(otutable = my_otu_table, metadata = my_metadata, tree = fitGTR$tree)
AV.Object

KasperSkytte commented 5 years ago

Thanks. What does taxa_are_rows(ps) say? I think you just have to skip transposing

KasperSkytte commented 5 years ago

Can you test if this function works on your data directly? I haven't been able to find example data where taxa_are_rows = FALSE. Or do you mind if I play around with your data to test it?

#install ampvis2 with devtools::install_github("madsalbertsen/ampvis2")
#install phyloseq with BiocManager::install("phyloseq)

data("GlobalPatterns", package = "phyloseq")

#function to convert a phyloseq object to an ampvis2 object
phyloseq_to_ampvis2 <- function(physeq) {
  #check object for class
  if(!any(class(physeq) %in% "phyloseq"))
    stop("physeq object must be an object of class \"phyloseq\"", call. = FALSE)

  #ampvis2 requires taxonomy and abundance table, phyloseq checks for the latter
  if(is.null(physeq@tax_table))
    stop("No taxonomy found in the phyloseq object", call. = FALSE)

  #OTUs must be in rows, not columns
  if(phyloseq::taxa_are_rows(physeq))
    abund <- as.data.frame(phyloseq::otu_table(physeq)@.Data)
  else
    abund <- as.data.frame(t(phyloseq::otu_table(physeq)@.Data))

  #tax_table is assumed to have OTUs in rows too
  tax <- phyloseq::tax_table(physeq)@.Data

  #merge by rownames (OTUs)
  otutable <- merge(
    abund,
    tax,
    by = 0,
    all.x = TRUE,
    all.y = FALSE,
    sort = FALSE
  )
  colnames(otutable)[1] <- "OTU"

  #extract sample_data (metadata)
  if(!is.null(physeq@sam_data)) {
    metadata <- data.frame(
      phyloseq::sample_data(physeq),
      row.names = phyloseq::sample_names(physeq), 
      stringsAsFactors = FALSE, 
      check.names = FALSE
    )
  } else
    metadata <- NULL

  #extract phylogenetic tree, assumed to be of class "phylo"
  if(!is.null(physeq@phy_tree)) {
    tree <- phyloseq::phy_tree(physeq)
  } else
    tree <- NULL

  #extract OTU DNA sequences, assumed to be of class "XStringSet"
  if(!is.null(physeq@refseq)) {
    #convert XStringSet to DNAbin using a temporary file (easiest)
    fastaTempFile <- tempfile(pattern = "ampvis2_", fileext = ".fa")
    Biostrings::writeXStringSet(physeq@refseq, filepath = fastaTempFile)
  } else
    fastaTempFile <- NULL

  #load as normally with amp_load
  ampvis2::amp_load(
    otutable = otutable,
    metadata = metadata,
    tree = tree,
    fasta = fastaTempFile
  )
}

d <- phyloseq_to_ampvis2(GlobalPatterns)
d
#> ampvis2 object with 4 elements. 
#> Summary of OTU table:
#>      Samples         OTUs  Total#Reads    Min#Reads    Max#Reads 
#>           26        19216     28216678        58688      2357181 
#> Median#Reads    Avg#Reads 
#>      1106849   1085256.85 
#> 
#> Assigned taxonomy:
#>       Kingdom        Phylum         Class         Order        Family 
#>   19216(100%) 18974(98.74%) 18887(98.29%) 17460(90.86%) 13599(70.77%) 
#>         Genus       Species 
#>  7968(41.47%)   1413(7.35%) 
#> 
#> Metadata variables: 7 
#>  X_SampleID, Primer, Final_Barcode, Barcode_truncated_plus_T, Barcode_full_length, SampleType, Description

^{Created on 2019-08-21 by the reprex package (v0.3.0)}

bstamps commented 5 years ago

taxa_are_rows(ps) does return as FALSE, and I can't get the above example to successfully run on my data (Which I cannot share right now, my apologies)- it produces the same error as before of duplicate 'row.names' are not allowed. I'll try and locate some sharable data that flows through the DADA2 workflow that also produces this error...

KasperSkytte commented 5 years ago

I see I made a mistake in the phyloseq_to_ampvis2() function, which is corrected now. It used GlobalPatterns inside the function instead of the physeq data provided by the user. Try again. I just tried with some samples from a 2018 project, ran it through the dada2 tutorial and it loads fine.

KasperSkytte commented 5 years ago

@YiweiNiu Do you mind sharing your data so I can debug?

bstamps commented 5 years ago

I can try this at work tomorrow, will let you know ASAP. My apologies for delays.

KasperSkytte commented 5 years ago

Absolutely no rush! I'm most likely busy next week anyways. Thanks for contributing

YiweiNiu commented 5 years ago

Hi,

Sorry for the late reply.

The code, R, R packages are still the same as above. And still got the same error.

> any(duplicated(rownames(otutable)))
[1] FALSE
> which(duplicated(rownames(otutable)))
integer(0)
> taxa_are_rows(ps)
[1] FALSE

I would like to share my data. Here is the phyloseq object I used niuyw_ps.zip.

Bests, Yiwei Niu

KasperSkytte commented 5 years ago

Thank you @YiweiNiu. Good with some data. I now see that the problem is that sample ID's are rownames in the phyloseq metadata, but ampvis2 expects them to be in the first column. I'll make a solution to this soon.

YiweiNiu commented 5 years ago

Glad to know that. Very looking forward to the new version.

KasperSkytte commented 5 years ago

Here is a revised function. Works with your data @YiweiNiu. @bstamps can you confirm if it works with your data? I'll likely store the function as a gist and never include it in ampvis2 to avoid phyloseq as a dependency, ampvis2 is meant to be independent of any bioconductor packages.

phyloseq_to_ampvis2 <- function(physeq) {
  #check object for class
  if(!any(class(physeq) %in% "phyloseq"))
    stop("physeq object must be of class \"phyloseq\"", call. = FALSE)

  #ampvis2 requires taxonomy and abundance table, phyloseq checks for the latter
  if(is.null(physeq@tax_table))
    stop("No taxonomy found in the phyloseq object and is required for ampvis2", call. = FALSE)

  #OTUs must be in rows, not columns
  if(phyloseq::taxa_are_rows(physeq))
    abund <- as.data.frame(phyloseq::otu_table(physeq)@.Data)
  else
    abund <- as.data.frame(t(phyloseq::otu_table(physeq)@.Data))

  #tax_table is assumed to have OTUs in rows too
  tax <- phyloseq::tax_table(physeq)@.Data

  #merge by rownames (OTUs)
  otutable <- merge(
    abund,
    tax,
    by = 0,
    all.x = TRUE,
    all.y = FALSE,
    sort = FALSE
  )
  colnames(otutable)[1] <- "OTU"

  #extract sample_data (metadata)
  if(!is.null(physeq@sam_data)) {
    metadata <- data.frame(
      phyloseq::sample_data(physeq),
      row.names = phyloseq::sample_names(physeq), 
      stringsAsFactors = FALSE, 
      check.names = FALSE
    )

    #check if any columns match exactly with rownames
    #if none matched assume row names are sample identifiers
    samplesCol <- unlist(lapply(metadata, function(x) {
      identical(x, rownames(metadata))}))

    if(any(samplesCol)) {
      #error if a column matched and it's not the first
      if(!samplesCol[[1]])
        stop("Sample ID's must be in the first column in the sample metadata, please reorder", call. = FALSE)
    } else {
      #assume rownames are sample identifiers, merge at the end with name "SampleID"
      if(any(colnames(metadata) %in% "SampleID"))
        stop("A column in the sample metadata is already named \"SampleID\" but does not seem to contain sample ID's", call. = FALSE)
      metadata$SampleID <- rownames(metadata)

      #reorder columns so SampleID is the first
      metadata <- metadata[, c(which(colnames(metadata) %in% "SampleID"), 1:(ncol(metadata)-1L)), drop = FALSE]
    }
  } else
    metadata <- NULL

  #extract phylogenetic tree, assumed to be of class "phylo"
  if(!is.null(physeq@phy_tree)) {
    tree <- phyloseq::phy_tree(physeq)
  } else
    tree <- NULL

  #extract OTU DNA sequences, assumed to be of class "XStringSet"
  if(!is.null(physeq@refseq)) {
    #convert XStringSet to DNAbin using a temporary file (easiest)
    fastaTempFile <- tempfile(pattern = "ampvis2_", fileext = ".fa")
    Biostrings::writeXStringSet(physeq@refseq, filepath = fastaTempFile)
  } else
    fastaTempFile <- NULL

  #load as normally with amp_load
  ampvis2::amp_load(
    otutable = otutable,
    metadata = metadata,
    tree = tree,
    fasta = fastaTempFile
  )
}

bstamps commented 5 years ago

Can confirm this worked with my dataset. Thank you for working through this. I know you have your own data model and don't want phyloseq as a dependency of Ampvis2, but the gist will certainly be helpful for those of us that move between the two!

KasperSkytte commented 5 years ago

Great, glad it worked. I'll link it in the FAQ on the website.

KasperSkytte commented 5 years ago

https://gist.github.com/KasperSkytte/8d0ca4206a66be7ff6d76fc4ab8e66c6

KasperSkytte commented 5 years ago

And https://madsalbertsen.github.io/ampvis2/articles/faq.html#loading-phyloseq-objects

marcellinaR commented 1 year ago

Hello everyone,

I hope I can reopen this issue again. I have the same problem using my phyloseq data in the amp_load function. I used the provided function developed by @KasperSkytte and applied it to my phyloseq object. So far, I couldn't solve the Error message that I got when I applied the function:

phyloseq_to_ampvis2(seq_MR) Error in colSums(abund) : 'x' must be numeric 3. colSums(abund) 2. ampvis2::amp_load(otutable = otutable, metadata = metadata, tree = tree, fasta = fastaTempFile) 1. phyloseq_to_ampvis2(phyloseq) My phyloseq object "seq_MR" is rarefied and subsetted. My seq_MR OTU table is numeric: > is.numeric(seq_MR@otu_table) [1] TRUE

Of course, in the function process, the tax_table data is added to the otu_table, but that appeared to be no issue in the previous experiences shared here.

If you would like me to share my data, let me know! I would be very happy for some help & suggestions - Thank you! Marcellina

KasperSkytte commented 1 year ago

Hi @marcellinaR

Can you provide a reproducible example please?

marcellinaR commented 1 year ago

Hi thank you for your quick reply!

I am using R Studio (Version 2023.06.2+561) on my MacOS Ventura 13.3.1. This is my phyloseq object: seq_MR.rds.zip

I copied your function and applied this on my phyloseq object "seq_MR", as shown above. The function worked well on the provided data by YiweiNiu, but I received the already shared Error working with my data set:

> dM <- phyloseq_to_ampvis2(seq_MR)
Error in colSums(abund) : 'x' must be numeric
3.
colSums(abund)
2.
ampvis2::amp_load(otutable = otutable, metadata = metadata, tree = tree,
fasta = fastaTempFile)
1.
phyloseq_to_ampvis2(seq_MR)

The difference that I noticed comparing my otu and tax_table to YiweiNiu's is that her columnnames contain sequences. I have OTUs as rows and sampleIDs/taxonomy as columns:

> dim(seq_MR@otu_table)
[1] 1400   84
> seq_MR@otu_table
OTU Table:          [1400 taxa and 84 samples]
                     taxa are rows
      MW007.fastq_silva MW008.fastq_silva MW009.fastq_silva MW011.fastq_silva MW013.fastq_silva MW016.fastq_silva MW017.fastq_silva
3                    34                33                34                84                35                29                71
43656                 0                 0                 6                 0                 0                 0                 0
43534                 3                 2                 1                 0                 0                 0                 1
43481                 0                 0                 0                 0                 0                 0                 0
46449                30                20                26                76                93                24                23
50242                 8                 9                12                11                 5                 2                 6
50471              3345              1380              1136                46                13               621              3839
50381             17268             18173             18558             19940             22360             22335             15875
46260                 1                 0                 0                 1                 0                 0                 2

> seq_MR@tax_table
Taxonomy Table:     [1400 taxa by 9 taxonomic ranks]:
      Kingdom    Phylum                          Class                          Order                                
3     "Bacteria" "unclassified"                  "unclassified"                 "unclassified"                       
43656 "Bacteria" "Actinobacteriota"              "Actinobacteria"               "Propionibacteriales"                
43534 "Bacteria" "Actinobacteriota"              "Actinobacteria"               "Micrococcales"                      
43481 "Bacteria" "Actinobacteriota"              "Actinobacteria"               "Micrococcales"                      
46449 "Bacteria" "Proteobacteria"                "Gammaproteobacteria"          "Enterobacterales"                   
50242 "Bacteria" "Proteobacteria"                "Gammaproteobacteria"          "Enterobacterales"                   
50471 "Bacteria" "Proteobacteria"                "Gammaproteobacteria"          "Pseudomonadales"

Thank you!

KasperSkytte commented 1 year ago

The issue is the format of the taxonomy. It's weird. I'll make some checks and update. 2 min

KasperSkytte commented 1 year ago

Since this issue and the gist mentioned in the FAQ amp_load has matured quite a bit and now include various checks for the formats of the individual objects provided. So I decided to do this properly now. As mentioned earlier the phyloseq package will not be a required package by ampvis2, but by just returning an error instead if it's not installed it's now possible to provide a phyloseq class object directly to amp_load. It's as simple as: amp_load(phyloseq_object). Update to the recent version 2.8.4 for it to be available.

@marcellinaR In your case you need to ensure a "OTU" column is present in the taxonomy. It's required.

marcellinaR commented 1 year ago

Alright, thank you I had version ‘2.8.3’. I now have packageVersion ‘2.8.4’.

I have tried to modify my taxonomy table and my OTU table by adding an OTU column before and reading everything in the amp_load() function as data frames. I received the duplicate row.names error.

I now tried to change my OTU and taxonomy table and created a new phyloseq object to apply the amp_load() function. But phyloseq does not take an OTU table containing non-numeric values.

otu_tab <- otu_table(seq_MR)
otu_tab <- as.data.frame(otu_tab)
otu_tab$OTU <- rownames(otu_tab)
otu_tab <- as.matrix(otu_tab)

otu_table = otu_table(otu_tab, taxa_are_rows = TRUE, errorIfNULL=TRUE)
Error in validObject(.Object) : invalid class “otu_table” object: 
 Non-numeric matrix provided as OTU table.
Abundance is expected to be numeric.

However, I can add an OTU column only to my tax_table and keep the OTU table and sample_data unchanged. Applying the amp_load() function here, I only receive the warning "Could not find a column named OTU/ASV in otutable, using rownames as OTU ID's "

> # prepare data with additional "OTU" and "SampleID" columns as required for amp_load()
> otu_tab <- otu_table(seq_MR)
> meta_tab <- sample_data(seq_MR)
> tax_tab <- tax_table(seq_MR)
> tax_tab <- as.data.frame(tax_tab)
> tax_tab$OTU <- rownames(tax_tab)
> tax_tab <- as.matrix(tax_tab)

> otu_table = otu_table(otu_tab, taxa_are_rows = TRUE, errorIfNULL=TRUE)
> sam_data = sample_data(meta_tab, errorIfNULL = TRUE)
> tax_table = tax_table(tax_tab, errorIfNULL=TRUE)

> # make phyloseq
> seq_MR_amp = phyloseq(sam_data,otu_table,tax_table)

> # load in ampvis2 object
> amp_test <- amp_load(seq_MR_amp)
Phyloseq object provided. Ignoring anything provided for the metadata, taxonomy, fasta, or tree arguments, using those from the phyloseq object instead.
Warning message:
Could not find a column named OTU/ASV in otutable, using rownames as OTU ID's

The same appears if I simply use my entire phyloseq object.

> amp_MR <- amp_load(seq_MR)
Phyloseq object provided. Ignoring anything provided for the metadata, taxonomy, fasta, or tree arguments, using those from the phyloseq object instead.
Warning messages:
1: Could not find a column named OTU/ASV in otutable, using rownames as OTU ID's 
2: Could not find a column named OTU/ASV in taxonomy, using rownames as OTU ID's

If I look at the structure of my ampvis2 object ("amp_MR"), everything looks alright I would say. Therefore, I might have a super simple and stupid question now, but how can I have a look at my e.g. OTU table as I would do with my phyloseq object using seq_MR@otu_table or: head(seq_MR@otu_table)? To check if OTU IDs etc. are present and if the object is good to use for analysis? This is my ampvis2 object:

> amp_MR
ampvis2 object with 3 elements. 
Summary of OTU table:
     Samples         OTUs  Total#Reads    Min#Reads    Max#Reads Median#Reads    Avg#Reads 
          84         1400      2117556        25209        25209        25209        25209 

Assigned taxonomy:
     Kingdom       Phylum        Class        Order       Family        Genus      Species 
  1400(100%) 1398(99.86%)   1400(100%) 1398(99.86%) 1395(99.64%) 1398(99.86%)   1400(100%) 

Metadata variables: 34 
 SampleID, approach, kingdom, phylum, class, order, family, genus, species, taxon, order_tax, NMS_time, NMS_time_txt, variant, mother_colony, collection_date, depth_small, depth_group, depth_m, depth_fathoms, lat, long, location, geo_group, extraction_date, control_ID, qubit_br, qubit_hs, nanodrop, A_260_280, A_260_230, fixative, loc_time, time_tac

Thanks again for your help!!

KasperSkytte commented 1 year ago

It looks like you are loading things through phyloseq first, then import with amp_load. Why not just skip phyloseq? Looks over-complicated to me. The taxonomy you provided here is a mess, I would go upstream and fix things. You must have some proper OTU/ASV names from the processing workflow. You also have "unclassified" fields everywhere. But in the end a warning is just a warning. You can continue with analysis as-is. An ampvis2 object is just a list, use View() or similar.

marcellinaR commented 1 year ago

So far, my analysis worked well using my data in a phyloseq format, or alternatively usingpsmelt() prior to analysis. But I extracted my data into data frames now and adjusted the structure - as expected, no warnings anymore.

The OTU& taxonomic table was provided to me in a biom format for my master thesis, which I then used for further processing and analysis. Sequencing was conducted using the MinION. Kranken2.1.2 in conjunction with the SILVA reference database (138.1 release) was used for taxonomic assignment (following: DOI: 10.1111/mec.16462). Further processing, rarefaction etc. of the data was done with phyloseq.

I know this is a bit off-topic now, but if you have the additional minutes, I would be very thankful to hear your suggestions to solve the "mess" in my current data and also to improve my analysis in the future. Which part of the upstream analysis should have been done differently based on the information I provided?

Anyways, thank you for your help, I am happy to explore ampvis2 now.

KasperSkytte commented 1 year ago

You can also load biom format data directly into ampvis2. With mess I mean the column names are not all taxonomic levels (Phy_Fam and OTU_tax, should only contain Kingdom -> Species+OTU) and if you have "unclassified" entries everywhere they are going to be considered a single taxon, even though they are not. R doesn't know about bacterial names, so it's up to you to do a sanity check first. You'll otherwise get inflated abundances with a mysterious bacterium called "unclassified", when they are in fact from many different unrelated species. But I assume you have been using SILVA. It's notoriously bad with regards to weird names and placeholders in the actual taxonomy, so you should be removing those first. There are others, you can gsub all columns with a pattern like "unclassified|uncultured|unknown|unidentified|incertae sedis|metagenome|\\bbacterium\\b|\\bpossible\\b". If they are environmental samples from sludge or digesters I highly recommend the MiDAS database, which I have co-authored. With nanopore reads we almost always get species classifications for 99%+ with just a raw minimap2 mapping. Script here for inspiration https://github.com/cmc-aau/fl16S_nanopore/.

marcellinaR commented 1 year ago

Alright! Thank you for your help and your suggestions, I will apply that to clean my data. Thanks also, for providing the script and recommending MiDAS as database - I will look into that!

KasperSkytte / ampvis2

non-unique value when setting 'row.names' when loading phyloseq object #68

Extract metadata from the phyloseq object:

Load the data with amp_load: