Open haiyangzhang798 opened 2 years ago
This is a great question. In the paper, that occypancy>10 number was calculated via simulation, i.e. by down-sampling real data. Creating a function to do this for user data would be quite easy, and would be a good addition to the package. I'll prototype such a function and add the code to this issue, and eventually add it to the main package.
Try this out. See the example for a how-to.
#' occ_power
#'
#' Downsamples a feature vector x to lower occupancies, and calculates Spec
#' on each resulting vector. Useful for power analysis. phy_or_env_spec is
#' run on simulated (downsampled) vectors. It is run with denom_type =
#' "sim_center" to speed up computation.
#'
#' @author John L. Darcy
#'
#' @param x numeric vector. A feature vector, i.e. a column from the abunds_mat
#' argument to phy_or_env_spec()
#' @param reps_per_core. integer. Number of simulated vectors to create from x
#' for each occupancy level (default: 10)
#' @param min_occ integer. Smallest occupancy level to simulate (default: 5)
#' @param max_occ integer. Largest occupancy level to simulate. If NULL, will
#' be set to length(x) (default: NULL)
#' @param ... Arguments to be passed to phy_or_env_spec(). Must include env or
#' hosts and hosts_phylo. Suggest including n_cores and n_sim. Cannot include
#' denom_type.
#'
#' @return data.frame object with following columns:
#' \describe{
#' \item{Pval:}{
#' P-value from phy_or_env_spec()
#' }
#' \item{Spec:}{
#' Spec from phy_or_env_spec(); NOTE: values >0 are not corrected.
#' }
#' \item{occupancy:}{
#' occupancy of simulated vector
#' }
#' }
#'
#' @examples
# library(specificity)
# data(endophyte)
# env <- endophyte$metadata$Elevation
# # find an empirical feature vector with high occupancy and strong specificity.
# abunds_mat <- occ_threshold(prop_abund(endophyte$otutable), 100)
# empirical_specs <- phy_or_env_spec(abunds_mat, env, denom_type="sim_center", n_cores=10)
# plot(empirical_specs$Spec ~ colSums(abunds_mat > 0))
# x <- abunds_mat[, which.min(empirical_specs$Spec)]
# # run this function. speed up by lowering reps_per_occ.
# sim_specs <- occ_power(x, env=env, n_cores=10, chunksize=250)
# # plot
# plot(sim_specs$Pval ~ sim_specs$occupancy)
# # calculate false negative rate (FNR) for each occupancy
# fnrs <- aggregate(Pval~occupancy, sim_specs, FUN=function(x){ sum(x >= 0.05) / length(x) })
# colnames(fnrs) <- c("occupancy", "FNR")
# # add spline fit
# fnrs$FNR_spline <- smooth.spline(x=fnrs$occupancy,y=fnrs$FNR)$y
# # make a plot
# plot(FNR~occupancy, data=fnrs)
# points(FNR_spline ~ occupancy, data=fnrs, type="l")
#'
#' @export
occ_power <- function(x, reps_per_occ=10, min_occ=5, max_occ=NULL, ...){
calc_occ <- function(x){sum(x > 0)}
if(is.null(max_occ)){ max_occ <- calc_occ(x) }
occs <- rep(seq(from=min_occ, to=max_occ), reps_per_occ)
xpositions <- seq_along(x)
abunds_mat <- do.call("cbind", lapply(occs, function(occ){
xi <- x
while(calc_occ(xi) > occ){
xi[sample(xpositions, 1)] <- 0
}
return(xi)
}))
out <- phy_or_env_spec(abunds_mat=abunds_mat, denom_type="sim_center", ...)
out$occupancy <- colSums(abunds_mat > 0) # recalculate empirical occupancy
return(out)
}
Tested already and it works!! Super! Many thanks, John.
Many thanks for the interesting work. I wonder how to decide the threshold number for the low-occupancy data in their own dataset? For example, you used 10 (i.e., occupying less than 10 samples) in the endophyte dataset, but will that number apply to one's own microbial data? Any code to run a sensitivity analysis with one's own data to determine the threshold? Thanks.