Managing parallelism on cluster

drisso / zinbwave

Clone of the Bioconductor repository for the zinbwave package, see https://bioconductor.org/packages/zinbwave

43 stars 10 forks source link

Managing parallelism on cluster #35

Closed afinneg2 closed 6 years ago

afinneg2 commented 6 years ago

Hello,

Thank you for this very useful package! I am running into trouble using zinbWave on a cluster environment with a slurm scheduler.

Briefly, my issue is that the command: zinbFit(cur.se, K=20, X="~nUMI + sample + percent.mito", verbose = TRUE, BPPARAM=MulticoreParam(workers=cores))

seems to always use all cores on the node, regardless of how many are requested by the value of the cores variable. For example, when I set cores=1, I get load averages on the linux cluster equal to 24 (the number of cores on the node). When I set cores to any value greater than 1 , I get load averages that continue to grow in excess of number of available cores and lots of zombie processes and I have to kill the job or risk crashing the node.

I understand the issue could be with the biocParallel package or specific to the setup of my cluster. Nevertheless I am wondering if you have encountered an a similar problem or if you could recommend a biocParallel setup that works for zinbwave on clusters managed by SLURM.

I greatly appreciate any help !

drisso commented 6 years ago

Hi @afinneg2 ,

I haven't seen the behavior that you describe and it might indeed be an issue of BiocParallel, since the only thing that I do in zinbFit is to pass the argument BPPARAM to bplapply.

I only have a couple of suggestions to try and figure out if this is a problem with zinbwave or BiocParallel:

Try to run the following:

BiocParallel::register(BiocParallell:MulticoreParam(workers=cores))
zinbFit(cur.se, K=20, X="~nUMI + sample + percent.mito", verbose = TRUE)

and see if the behavior remains the same.

Try a simpler script that directly uses bplapply, e.g.

bplapply(seq_len(10), function(x) {
  Sys.sleep(1)
  rnorm(100)
  }, BPPARAM = MulticoreParam(workers = cores))

and see if the behavior is the same. If 2, then it's for sure BiocParallel. If 1, perhaps there's something going on in the way zinbFit passes BPPARAM to bplapply.

Simon-Coetzee commented 6 years ago

A heads up on this issue is if your cluster has installed an optimized BLAS implementation (e.g. openblas or mkl) much of the matrix work may be parallelized automatically, over and above any explicit parallelization from BiocParallel. Ask your sysadmin.

afinneg2 commented 6 years ago

@Simon-Coetzee and @drisso , Thank you very much for your advice and help. I agree with @Simon-Coetzee diagnosis. I have found the following to work in my cluster environment

nThreads=2
export OPENBLAS_NUM_THREADS=$nThreads OMP_NUM_THREADS=$nThreads MKL_NUM_THREADS=$nThreads
Rscript run_zinb_setWorkers.R

where in run_zinb_setWorkers.R, in addition to running zinbwave, I set

BiocParallel::register(BiocParallell:MulticoreParam(workers=nWorkers))

I choose nWorkers*nThreads = number cores to use -2 This seems to work. There is not problem of exploding load averages but I still have zombie processes generated .

Thanks again.