Suggestion for parallelized gridsearch

peterraphael commented 4 years ago

Dear Cécile, dear Viviane,

The lcmm package is great. Thanks for your work!

Lately, I have been working a lot with jointlcmm models. Fitting more complex joint models is computationally very expensive, and performing a grid search can take much time. As I get impatient easily, I want to suggest a parallelized version of gridsearch, which can reduce the waiting time quite a bit (depending on the available hardware).

The proposed gridsearch.parallel function call does not differ from that of gridsearch except for the cl argument expecting a cluster created using parallel::makeCluster.

gridsearch.parallel <- function(m,rep,maxiter,minit,cl=NULL)
{
  if(!is.null(cl)){
    clusterSetRNGStream(cl)
    mc <- match.call()$m
    mc$maxiter <- maxiter
    assign("minit",eval(minit))

    clusterCall(cl, function () require(lcmm))
    clusterExport(cl, list("mc", "maxiter", "minit", as.character(as.list(mc[-1])$data)), envir = environment())

    cat("Be patient, grid search is running ...\n")

    models <- parLapply(cl, 1:rep, function(X)
    {
      mc$B <- substitute(random(minit),parent.frame(n=2))
      return(do.call(as.character(mc[[1]]),as.list(mc[-1])))
    }
    )

    llmodels <- sapply(models,function(x){return(x$loglik)})

    kmax <- which.max(llmodels)
    mc$B <- models[[kmax]]$best
    mc$maxiter <- NULL

    cat("Search completed, performing final estimation\n")

    return(do.call(as.character(mc[[1]]),as.list(mc[-1])))
  }
  return(do.call(gridsearch, as.list(match.call()[2:5])))
}

Evaluation

Considering the example form the gridsearch man page, but using 250 replications instead of 50. The execution time halved using all cores on a machine with four physical cores (average increase in speed by factor 2.1, 1.9, and 1.6 using four, three or two cores, compared to the non-parallelized gridsearch; see below).

m1 <- hlme(Y ~ Time * X1, random =~ Time, subject = 'ID', ng = 1, 
           data = data_hlme)

library(parallel)

times <- matrix(NA, 10, 5)
for(i in 1:10){
  times[i,1] <- system.time(
    gridsearch(rep = 250, maxiter = 10, minit = m1, hlme(Y ~ Time * X1,
      mixture =~ Time, random =~ Time, classmb =~ X2 + X3, subject = 'ID',
      ng = 2, data = data_hlme))
  )[3]
  for(j in 1:4){
      times[i,j+1] <- system.time({
        cl <- makeCluster(j)
        gridsearch.parallel(rep = 250, maxiter = 10, minit = m1, hlme(Y ~ Time * X1,
          mixture =~ Time, random =~ Time, classmb =~ X2 + X3, subject = 'ID',
          ng = 2, data = data_hlme), cl=cl)
        stopCluster(cl)
      })[3]
  }
}

Rplot01 As there is some overhead for the initialization of clusters, I would expect the relative increase in speed for computational more expensive models to be higher than for the presented example.

Best, Raphael

CecileProust-Lima commented 4 years ago

Thank you very much for your feedback. You are completely right, gridsearch should be available in parallel, computations are completely independent and thus perfect for parallel computing. In addition, as you mention, it should let users run much more replicates, which is always good and recommended with latent class models. This is actually something we do in our own works but we had not provided it yet on the released version. We will take into account your advice and add this soon. Thanks again! Cécile

peterraphael commented 4 years ago

Great, I am looking forward to it.

CecileProust-Lima commented 4 years ago

Hi, I am happy to tell you that the new version of lcmm is now on cran : version 1.9.1. It now includes the gridsearch with parallel computations as you suggested. Thanks! You will see, it also includes the estimation of latent class models for multiple longitudinal markers in mpjlcmm function. Best Cécile

CecileProust-Lima / lcmm

Suggestion for parallelized gridsearch #39

Evaluation