JohnsonHsieh / iNEXT

R package for interpolation and extrapolation
https://JohnsonHsieh.github.com/iNEXT
57 stars 26 forks source link

Running time on large data sets #32

Closed trichterhb closed 6 years ago

trichterhb commented 6 years ago

I have parallel jobs of iNEXT, running on 60 samples each, with an average sample size of about 250,000 observations. Here is a screenshot; My Rsessions are now running for several weeks. The R-instances are still responsive.

unbenannt

Is that normal?

JohnsonHsieh commented 6 years ago

Hi @nouse123

At present, the q=1 rarefy algorithm has a problem of slow calculation when the number of samples is too large. Maybe you should try q = 0 or 2 first.

JohnsonHsieh commented 6 years ago

Hi @nouse123 maybe you could use OpenBLAS to speed up R performance.

https://www.r-bloggers.com/why-is-r-slow-some-explanations-and-mklopenblas-setup-to-try-to-fix-this/

julianselke commented 2 years ago

Late answer, but a solution:

Working with datatype="incidence_freq" instead of abundance reduces the computation time dramatically (<1 min versus aborted after >10 hours). When transforming data from abundance to incidence_freq make sure to sum incidences ([0,1]), not abundances (https://github.com/JohnsonHsieh/iNEXT/issues/3).


A comment on microbiome data:

In the case of microbiome data the used estimator may not be suitable to work on abundances, since they do not present species counts but rather amplicon counts. Therefore, counts of 1 or 2 are rarely observed. The estimator, however, is based on singleton and doubleton observations across the species (Chao & Jost, 2012; https://doi.org/10.1890/11-1952.1). Working with abundances may result in significant overestimation of sample coverage.

canuckafar commented 2 years ago

@julianselke

Can you demonstrate exactly how you converted your data to do this? For example there is a dataset called spider in the iNEXT package that is coded as abundance data. How would you convert spider to incidence frequency data ?

julianselke commented 2 years ago

@canuckafar

The data set spider cannot be converted to incidence_freq.

spider lists the abundance counts per specie and site. However, for incidence_freq, you need the count of samples/sites a specie occurred in. Since the data set does not give information on species identity, we cannot convert it.

We could, however, assume that species indices are always the same at different sites (first abundance count is specie A, second is specie B, ...) and convert the abundance data to raw_incidence:

# fill "missing species" of site with less elements with NA 
max_len <- max(length(spider[[1]]), length(spider[[2]]))
length(spider[[1]]) <- max_len
length(spider[[2]]) <- max_len

# transform to raw_abundance data, i.e. a species X sites dataframe
raw_abundance <- unlist(spider) %>% matrix(nrow=length(spider),byrow=T) %>% t() %>% 
  data.frame() %>% `colnames<-`(names(spider)) 

# transform to raw_incidence data, i.e. presence/absence
raw_incidence <- ifelse(raw_abundance > 0, 1, 0)
# replace NAs with zeros
raw_incidence[is.na(raw_incidence)] <- 0

Now we can use a function of iNEXT to transform the data to incidence_freq.

incidence_freq <- as.incfreq(raw_incidence)


But ideally you would have raw_abundance or raw_incidence data to start with.


The data I worked with consisted of two dataframes, a feature table (which would be raw_abundance) and a metadata table. Let me illustrate what I did using the GlobalPatterns dataset of the package phyloseq:

# function arguments:
# feature_table:  a data.frame (rows are samples/sites)
# meta_data:      a data.frame
# var_group:      a metadata column to summarise by
# id_col:         the metadata column corresponding to sample/site names of the feature table

to_incidence_freq <- function(feature_table, meta_data, var_group, id_col){
  group <- deparse(substitute(var_group))
  id_col <- deparse(substitute(id_col))
  a <- as.data.frame(ifelse(feature_table > 0, 1, 0))
  a[[id_col]] <- rownames(a)
  b <- dplyr::left_join(a, meta_data[,c(id_col, group)], by = id_col)
  b <- b[,c(1:(ncol(b)-2),ncol(b))]
  c <- b %>% dplyr::group_by({{ var_group }}) %>% dplyr::summarise(dplyr::across(dplyr::everything(), sum))
  class(c) <- "data.frame"
  d <- as.list(as.data.frame(t(c)))
  names(d) <- sort(base::unique(meta_data[,group]))
  d <- lapply(d, function(x) 
    as.numeric(c(length(b[,group][b[,group] == x[1]]), x[2:length(x)])))
  d <- lapply(d, function(x) {x[x!=0]})
  return(d)
}

library(phyloseq)
data(GlobalPatterns)

md <- sample_data(GlobalPatterns)
class(md) <- "data.frame"
# feature table in phyloseq has samples/sites as columns
ft <- otu_table(GlobalPatterns) %>% data.frame() %>% t() %>%  data.frame()

x <- to_incidence_freq(ft, md, SampleType, X.SampleID)


I hope this helps you!