Closed AlineTalhouk closed 8 years ago
in progress
Reproducible example:
# Load packages, register parallel backend
pkgs <- c("parallel", "doParallel", "foreach", "dplyr", "bioDist", "microbenchmark")
invisible(lapply(pkgs, function(x)
suppressPackageStartupMessages(library(x, character.only = TRUE))))
ncores <- detectCores()
cl <- makeCluster(ncores)
registerDoParallel(cl)
# Simulate data and store variables
set.seed(911)
x <- matrix(rnorm(1000), nrow = 10)
k <- 4
min.sd <- 1
pItem <- 0.8
reps <- 10
seed <- 123456
seed.method <- 1
x.rest <- diceR:::prepare_data(x, min.sd = min.sd)
x.nmf <- x.rest %>%
rbind(-.) %>%
apply(2, function(x) ifelse(x < 0, 0, x))
method <- c("hcAEucl", "hcDianaEucl", "nmfDiv")
samples <- colnames(x.rest)
n <- ncol(x.rest)
n.new <- floor(n * pItem)
nm <- length(method)
# Compare run time of Method 1 (for loop) and Method 2 (foreach loop)
mb <- microbenchmark(
Method1 = {
coclus <- array(NA, c(n, reps, nm),
dimnames = list(samples, paste0("R", 1:reps), method))
for (j in 1:nm) {
set.seed(seed.method)
for (i in 1:reps) {
ind.new <- sample(n, n.new, replace = FALSE)
if (any(c("nmfDiv", "nmfEucl") %in% method))
x.nmf.samp <- x.nmf[!(apply(x.nmf[, ind.new], 1,
function(x) all(x == 0))), ind.new]
coclus[ind.new, i, j] <- switch(
method[j],
nmfDiv = NMF::predict(NMF::nmf(
x.nmf.samp, rank = k, method = "brunet", seed = seed)),
hcAEucl = cutree(hclust(dist(t(x.rest[, ind.new])),
method = "average"), k),
hcDianaEucl = cutree(cluster::diana(euc(t(x.rest[, ind.new])),
diss = TRUE), k))
}
}
coclus1 <- coclus
},
Method2 = {
coclus <- array(NA, c(n, reps, nm),
dimnames = list(samples, paste0("R", 1:reps), method))
coclus <- foreach(j = 1:nm, .packages = c("foreach", "bioDist")) %dopar% {
set.seed(seed.method)
foreach(i = 1:reps) %dopar% {
ind.new <- sample(n, n.new, replace = FALSE)
if (any(c("nmfDiv", "nmfEucl") %in% method))
x.nmf.samp <- x.nmf[!(apply(x.nmf[, ind.new], 1,
function(x) all(x == 0))), ind.new]
coclus[ind.new, i, j] <- switch(
method[j],
nmfDiv = NMF::predict(NMF::nmf(
x.nmf.samp, rank = k, method = "brunet", seed = seed)),
hcAEucl = cutree(hclust(dist(t(x.rest[, ind.new])),
method = "average"), k),
hcDianaEucl = cutree(cluster::diana(euc(t(x.rest[, ind.new])),
diss = TRUE), k)
)
}
coclus[, , j]
} %>%
unlist() %>%
array(dim = c(n, reps, nm),
dimnames = list(paste0("V", 1:n), paste0("R", 1:reps), method))
coclus2 <- coclus
}, times = 5)
mb
#> Unit: seconds
#> expr min lq mean median uq max neval cld
#> Method1 3.952632 3.952990 4.027329 4.004446 4.015033 4.211546 5 b
#> Method2 3.118724 3.171489 3.183520 3.178968 3.208202 3.240214 5 a
# Compare objects
identical(coclus1, coclus2)
#> [1] TRUE
# Stop cluster
stopImplicitCluster()
Hi @dchiu911 and @liujohnson118 might be time for a quick meeting to chat, are you guys free now?
From today's meeting, we just need a test for larger number of reps, algorithms to see how it scales.
Runtime results
x <- matrix(rnorm(1000), nrow = 10)
reps <- 1000
# All except biclust, since dimension of x is too small the function breaks
method <- c("nmfDiv", "nmfEucl", "hcAEucl", "hcDianaEucl", "kmEucl",
"kmSpear", "pamEucl", "pamSpear", "apEucl", "scRbf", "gmmBIC")
#> Unit: seconds
#> expr min lq mean median uq max neval
#> Method1 1110.5376 1110.5376 1110.5376 1110.5376 1110.5376 1110.5376 1
#> Method2 478.6027 478.6027 478.6027 478.6027 478.6027 478.6027 1
It appears the parallel version is over 2 times faster.
Oh yeah! now you're talking!
let's do it!
Note to self: add param parallel = TRUE
in ConClust()
option to also trigger parallel computing under specific settings egs number of reps >x and the number of cores >2
sure I will think of appropriate names and defaults
When you say "trigger", do you mean the user inputs certain parameters, but he doesn't know whether parallelization is being performed until the function is running?
Yes, so for example if someone is only asking for one algorithm and 500 reps on a small data set no need to do it, or if someone doesn't have multiple cores, can't do it.. etc
I don't feel this is crucial honestly, it could be a user set parameter with a default that gives error message if it's an issue
I agree there can be internal assertions that verify that the user can actually use parallel if they so choose, such as if parallel::detectCores()
returns 1 (don't know if that ever happens).
I'm not certain how to assert that there is "no need" however. I'm in the midst of running some simulations right now and calculating run times.
OK sounds good no need to worry about that for now.
For 10 reps, gain in time is 5 to 6 seconds. For 100 reps, gain in time is over 2 times.
Smaller gain in time with fewer replicates since most of computation time for parallel is for registering the workers instead of actual clustering. Please review commit 7d674ce4 before closing.
Example code I used for testing:
library(diceR)
data(hgsc)
x <- t(hgsc[, -1])[1:200, 1:100]
reps <- 500
k <- 4
mb <- microbenchmark::microbenchmark(
Method1 = {
c1 <- ConClust(x, k = k, reps = reps, parallel = FALSE)
},
Method2 = {
c2 <- ConClust(x, k = k, reps = reps, parallel = TRUE)
},
times = 1
)
mb
identical(c1, c2)
looks good, I guess given that you are the unit test police, this was done on this feature as well?
not yet, unit tests to be added to cover these cases
remember when testing to make sure it is compatible with running on a cluster (eg gsc)
This is where the Travis CI builds would have come in handy. There is something about commit 7f07ff1c330219bebbc8a8590e2d1c22593c1265 that is causing the build to stall that I haven't been able to figure out. I basically set up the build to try on 3 different R versions:
oldrel
(an older release)release
(current version)devel
(development version)So that we see whether the package can work under all these cases.
investigate