TheoreticalEcology / s-jSDM

Scalable joint species distribution modeling
https://cran.r-project.org/web/packages/sjSDM/index.html
GNU General Public License v3.0
68 stars 14 forks source link

How to improve GPU utilisation - how to parallelise over GPUs #74

Open MaximilianPi opened 3 years ago

MaximilianPi commented 3 years ago

Question from a user:

I ran the code in a container as a script processing job on AWS and it worked. I used an EC2 AWS instance with 4 NVIDIA Tensor V100 GPUs. However, I checked the GPU utilization and the job only used about 30% GPU, although the range is 0-400% for 4 GPUs, and it also used only about 10% GPU memory. A lot of available resources were not used by sjSDM. Is there a way to maximise/optimise the GPU usage within sjSDM? Thanks.

Job description: Tuning sjSDM-DNN via CV

Answer: GPU utilisation is limited by the model size and or the dataset (number of species), here is a list of model specifications that affect utilisation but can also have side effects:

Batch size is the only 'neutral' model argument allowing to increase GPU utilisation without changing the model itself.

However, if you have a job where you have to run multiple sjSDM models (e.g. hyper-parameter tuning) and each model requires only a fraction of the GPU's resources, we can parallelise over the GPU, i.e. we run several sjSDM models simultaneously on one GPU.

Toy example: Tuning learning rate

library(sjSDM)
com = simulate_SDM()
X = com$env_weights
Y = com$response

lr = seq(0.001, 0.5, length.out = 10L)
ll = NULL
for(i in 1:10) {
  m = sjSDM(Y, X, learning_rate = lr[i], device = "gpu")
  ll[i] = logLik(m)
}

n models on 1 GPU

Task: training n sjSDMs Hardware: 1 GPU

We parallelise the loop by replacing it by 'parSapply' (foreach or future could be also used) so that each CPU slave runs one sjSDM model on our GPU. The number of cpu cores should be set to maximise GPU utilisation, e.g. if one model consumes 20% of the GPU's resources (can be checked via nvtop) not more than 4-5 CPU cores should be used (ofc it depends also on the available CPU cores!!!):

cl = parallel::makeCluster(4L)
parallel::clusterExport(cl, list("X", "Y"))
parallel::clusterEvalQ(cl, {library(sjSDM)})

ll = parallel::parSapply(cl, lr, function(l) {
  m = sjSDM(Y, X, learning_rate = l, device = "gpu")
  return(logLik(m))
})
parallel::stopCluster(cl)

n models on j GPUs

Task: training n sjSDMs Hardware: j GPU

With more than one GPU, it would be nice to distribute the job among the available GPUs (which is automatically done by the sjSDM_cv function). GPUs are given a number by the system starting with 0 and we can use these numbers to control/identify the individual GPUs. If you have 3 gpus, the numbers will be 0-3 (so the only requirement is to know the overall number of GPUs).

Let's assume two GPUs and we want to run 5 jobs on each GPU which means that we need 10 CPU slaves. To equally distribute the jobs on the GPUs i) we will assign each CPU slave a GPU using the CPU slave IDs ii) create a simple which-node-which-gpu look-up table, and iii) use the node id and the look-up table to set the device in each iteration:

cl = parallel::makeCluster(10L)
nodes = unlist(parallel::clusterEvalQ(cl, paste(Sys.info()[['nodename']], Sys.getpid(), sep='-')))
# each node will get a gpu
which_gpu = cbind(nodes, 0:1)  

parallel::clusterExport(cl, list("X", "Y", "which_gpu"))
parallel::clusterEvalQ(cl, {library(sjSDM)})

ll = parallel::parSapply(cl, lr, function(l) {
  # How I am:
  myself = paste(Sys.info()[['nodename']], Sys.getpid(), sep='-')

  # get my device:
  device = as.integer(as.numeric(which_gpu[which(which_gpu[,1] %in% myself, arr.ind = TRUE), 2]))

  m = sjSDM(Y, X, learning_rate = l, device = device)
  ll = logLik(m)
  rm(m)
  gc()
  return(ll)
})
parallel::stopCluster(cl)

For large jobs (many iterations) it is better to explicitly clean the environment after each iteration.