AlineTalhouk / diceR

Diverse Cluster Ensemble in R
https://alinetalhouk.github.io/diceR/
Other
34 stars 10 forks source link

Parallel computing #19

Closed AlineTalhouk closed 8 years ago

AlineTalhouk commented 8 years ago

investigate

dchiu911 commented 8 years ago

in progress

dchiu911 commented 8 years ago

Reproducible example:

# Load packages, register parallel backend
pkgs <- c("parallel", "doParallel", "foreach", "dplyr", "bioDist", "microbenchmark")
invisible(lapply(pkgs, function(x)
  suppressPackageStartupMessages(library(x, character.only = TRUE))))
ncores <- detectCores()
cl <- makeCluster(ncores)
registerDoParallel(cl)

# Simulate data and store variables
set.seed(911)
x <- matrix(rnorm(1000), nrow = 10)
k <- 4
min.sd <- 1
pItem <- 0.8
reps <- 10
seed <- 123456
seed.method <- 1
x.rest <- diceR:::prepare_data(x, min.sd = min.sd)
x.nmf <- x.rest %>%
  rbind(-.) %>%
  apply(2, function(x) ifelse(x < 0, 0, x))
method <- c("hcAEucl", "hcDianaEucl", "nmfDiv")
samples <- colnames(x.rest)
n <- ncol(x.rest)
n.new <- floor(n * pItem)
nm <- length(method)

# Compare run time of Method 1 (for loop) and Method 2 (foreach loop)
mb <- microbenchmark(
  Method1 = {
    coclus <- array(NA, c(n, reps, nm),
                    dimnames = list(samples, paste0("R", 1:reps), method))
    for (j in 1:nm) {
      set.seed(seed.method)
      for (i in 1:reps) {
        ind.new <- sample(n, n.new, replace = FALSE)
        if (any(c("nmfDiv", "nmfEucl") %in% method))
          x.nmf.samp <- x.nmf[!(apply(x.nmf[, ind.new], 1,
                                      function(x) all(x == 0))), ind.new]
        coclus[ind.new, i, j] <- switch(
          method[j],
          nmfDiv = NMF::predict(NMF::nmf(
            x.nmf.samp, rank = k, method = "brunet", seed = seed)),
          hcAEucl = cutree(hclust(dist(t(x.rest[, ind.new])),
                                  method = "average"), k),
          hcDianaEucl = cutree(cluster::diana(euc(t(x.rest[, ind.new])),
                                              diss = TRUE), k))
      }
    }
    coclus1 <- coclus
  },
  Method2 = {
    coclus <- array(NA, c(n, reps, nm),
                    dimnames = list(samples, paste0("R", 1:reps), method))
    coclus <- foreach(j = 1:nm, .packages = c("foreach", "bioDist")) %dopar% {
      set.seed(seed.method)
      foreach(i = 1:reps) %dopar% {
        ind.new <- sample(n, n.new, replace = FALSE)
        if (any(c("nmfDiv", "nmfEucl") %in% method))
          x.nmf.samp <- x.nmf[!(apply(x.nmf[, ind.new], 1,
                                      function(x) all(x == 0))), ind.new]
        coclus[ind.new, i, j] <- switch(
          method[j],
          nmfDiv = NMF::predict(NMF::nmf(
            x.nmf.samp, rank = k, method = "brunet", seed = seed)),
          hcAEucl = cutree(hclust(dist(t(x.rest[, ind.new])),
                                  method = "average"), k),
          hcDianaEucl = cutree(cluster::diana(euc(t(x.rest[, ind.new])),
                                              diss = TRUE), k)
        )
      }
      coclus[, , j]
    } %>%
      unlist() %>% 
      array(dim = c(n, reps, nm),
            dimnames = list(paste0("V", 1:n), paste0("R", 1:reps), method))
    coclus2 <- coclus
  }, times = 5)

mb
#> Unit: seconds
#>     expr      min       lq     mean   median       uq      max neval cld
#>  Method1 3.952632 3.952990 4.027329 4.004446 4.015033 4.211546     5   b
#>  Method2 3.118724 3.171489 3.183520 3.178968 3.208202 3.240214     5  a 

# Compare objects
identical(coclus1, coclus2)
#>  [1] TRUE

# Stop cluster
stopImplicitCluster()
AlineTalhouk commented 8 years ago

Hi @dchiu911 and @liujohnson118 might be time for a quick meeting to chat, are you guys free now?

AlineTalhouk commented 8 years ago

From today's meeting, we just need a test for larger number of reps, algorithms to see how it scales.

dchiu911 commented 8 years ago

Runtime results

x <- matrix(rnorm(1000), nrow = 10)
reps <- 1000
# All except biclust, since dimension of x is too small the function breaks
method <- c("nmfDiv", "nmfEucl", "hcAEucl", "hcDianaEucl", "kmEucl",
            "kmSpear", "pamEucl", "pamSpear", "apEucl", "scRbf", "gmmBIC")

#> Unit: seconds
#> expr       min        lq      mean    median        uq       max neval
#> Method1 1110.5376 1110.5376 1110.5376 1110.5376 1110.5376 1110.5376     1
#> Method2  478.6027  478.6027  478.6027  478.6027  478.6027  478.6027     1

It appears the parallel version is over 2 times faster.

AlineTalhouk commented 8 years ago

Oh yeah! now you're talking!

let's do it!

dchiu911 commented 8 years ago

Note to self: add param parallel = TRUE in ConClust()

AlineTalhouk commented 8 years ago

option to also trigger parallel computing under specific settings egs number of reps >x and the number of cores >2

dchiu911 commented 8 years ago

sure I will think of appropriate names and defaults

dchiu911 commented 8 years ago

When you say "trigger", do you mean the user inputs certain parameters, but he doesn't know whether parallelization is being performed until the function is running?

AlineTalhouk commented 8 years ago

Yes, so for example if someone is only asking for one algorithm and 500 reps on a small data set no need to do it, or if someone doesn't have multiple cores, can't do it.. etc

I don't feel this is crucial honestly, it could be a user set parameter with a default that gives error message if it's an issue

dchiu911 commented 8 years ago

I agree there can be internal assertions that verify that the user can actually use parallel if they so choose, such as if parallel::detectCores() returns 1 (don't know if that ever happens).

I'm not certain how to assert that there is "no need" however. I'm in the midst of running some simulations right now and calculating run times.

AlineTalhouk commented 8 years ago

OK sounds good no need to worry about that for now.

dchiu911 commented 8 years ago

For 10 reps, gain in time is 5 to 6 seconds. For 100 reps, gain in time is over 2 times.

Smaller gain in time with fewer replicates since most of computation time for parallel is for registering the workers instead of actual clustering. Please review commit 7d674ce4 before closing.

Example code I used for testing:

library(diceR)
data(hgsc)
x <- t(hgsc[, -1])[1:200, 1:100]
reps <- 500
k <- 4

mb <- microbenchmark::microbenchmark(
  Method1 = {
    c1 <- ConClust(x, k = k, reps = reps, parallel = FALSE)
  },
  Method2 = {
    c2 <- ConClust(x, k = k, reps = reps, parallel = TRUE)
  },
  times = 1
)
mb

identical(c1, c2)
AlineTalhouk commented 8 years ago

looks good, I guess given that you are the unit test police, this was done on this feature as well?

dchiu911 commented 8 years ago

not yet, unit tests to be added to cover these cases

AlineTalhouk commented 8 years ago

remember when testing to make sure it is compatible with running on a cluster (eg gsc)

dchiu911 commented 8 years ago

This is where the Travis CI builds would have come in handy. There is something about commit 7f07ff1c330219bebbc8a8590e2d1c22593c1265 that is causing the build to stall that I haven't been able to figure out. I basically set up the build to try on 3 different R versions:

So that we see whether the package can work under all these cases.