Closed trichterhb closed 6 years ago
Hi @nouse123
At present, the q=1 rarefy algorithm has a problem of slow calculation when the number of samples is too large. Maybe you should try q = 0 or 2 first.
Hi @nouse123 maybe you could use OpenBLAS
to speed up R performance.
https://www.r-bloggers.com/why-is-r-slow-some-explanations-and-mklopenblas-setup-to-try-to-fix-this/
Late answer, but a solution:
Working with datatype="incidence_freq"
instead of abundance
reduces the computation time dramatically (<1 min versus aborted after >10 hours).
When transforming data from abundance
to incidence_freq
make sure to sum incidences ([0,1]), not abundances (https://github.com/JohnsonHsieh/iNEXT/issues/3).
In the case of microbiome data the used estimator may not be suitable to work on abundances, since they do not present species counts but rather amplicon counts. Therefore, counts of 1 or 2 are rarely observed. The estimator, however, is based on singleton and doubleton observations across the species (Chao & Jost, 2012; https://doi.org/10.1890/11-1952.1). Working with abundances may result in significant overestimation of sample coverage.
@julianselke
Can you demonstrate exactly how you converted your data to do this? For example there is a dataset called spider in the iNEXT package that is coded as abundance data. How would you convert spider to incidence frequency data ?
@canuckafar
The data set spider cannot be converted to incidence_freq.
spider lists the abundance counts per specie and site. However, for incidence_freq, you need the count of samples/sites a specie occurred in. Since the data set does not give information on species identity, we cannot convert it.
We could, however, assume that species indices are always the same at different sites (first abundance count is specie A, second is specie B, ...) and convert the abundance data to raw_incidence:
# fill "missing species" of site with less elements with NA
max_len <- max(length(spider[[1]]), length(spider[[2]]))
length(spider[[1]]) <- max_len
length(spider[[2]]) <- max_len
# transform to raw_abundance data, i.e. a species X sites dataframe
raw_abundance <- unlist(spider) %>% matrix(nrow=length(spider),byrow=T) %>% t() %>%
data.frame() %>% `colnames<-`(names(spider))
# transform to raw_incidence data, i.e. presence/absence
raw_incidence <- ifelse(raw_abundance > 0, 1, 0)
# replace NAs with zeros
raw_incidence[is.na(raw_incidence)] <- 0
Now we can use a function of iNEXT
to transform the data to incidence_freq.
incidence_freq <- as.incfreq(raw_incidence)
But ideally you would have raw_abundance or raw_incidence data to start with.
The data I worked with consisted of two dataframes, a feature table (which would be raw_abundance) and a metadata table. Let me illustrate what I did using the GlobalPatterns
dataset of the package phyloseq
:
# function arguments:
# feature_table: a data.frame (rows are samples/sites)
# meta_data: a data.frame
# var_group: a metadata column to summarise by
# id_col: the metadata column corresponding to sample/site names of the feature table
to_incidence_freq <- function(feature_table, meta_data, var_group, id_col){
group <- deparse(substitute(var_group))
id_col <- deparse(substitute(id_col))
a <- as.data.frame(ifelse(feature_table > 0, 1, 0))
a[[id_col]] <- rownames(a)
b <- dplyr::left_join(a, meta_data[,c(id_col, group)], by = id_col)
b <- b[,c(1:(ncol(b)-2),ncol(b))]
c <- b %>% dplyr::group_by({{ var_group }}) %>% dplyr::summarise(dplyr::across(dplyr::everything(), sum))
class(c) <- "data.frame"
d <- as.list(as.data.frame(t(c)))
names(d) <- sort(base::unique(meta_data[,group]))
d <- lapply(d, function(x)
as.numeric(c(length(b[,group][b[,group] == x[1]]), x[2:length(x)])))
d <- lapply(d, function(x) {x[x!=0]})
return(d)
}
library(phyloseq)
data(GlobalPatterns)
md <- sample_data(GlobalPatterns)
class(md) <- "data.frame"
# feature table in phyloseq has samples/sites as columns
ft <- otu_table(GlobalPatterns) %>% data.frame() %>% t() %>% data.frame()
x <- to_incidence_freq(ft, md, SampleType, X.SampleID)
I hope this helps you!
I have parallel jobs of iNEXT, running on 60 samples each, with an average sample size of about 250,000 observations. Here is a screenshot; My Rsessions are now running for several weeks. The R-instances are still responsive.
Is that normal?