astro-informatics / rcGAN

A Regularized Conditional GAN for Posterior Sampling in Image Recovery Problems
0 stars 0 forks source link

Multi-GPU training #1

Closed tobias-liaudat closed 12 months ago

tobias-liaudat commented 1 year ago

This issues is to track progress on multi-GPU training on slurm clusters

MatthijsMars commented 1 year ago

To make the lightning module work on multiple GPUs (and on multiple nodes) when using the SLURM workload manager, we need to be careful in setting up the SLURM job script. An example of how to do this can be found here https://pytorch-lightning.readthedocs.io/en/1.2.10/clouds/slurm.html.

In particular if we want to run on 4 GPUs on one node we need to make sure that we ask for 4 GPUs as well as 4 tasks (since lightning will create 1 task per GPU) per node:

#SBATCH --gres=gpu:4          # n_gpus
#SBATCH --ntasks-per-node=4   # ntasks needs to be same as n_gpus

Additionally I have created a script, find_batch_size.py that finds the largest batch_size that you can run per GPU. This depends on the VRAM available on the GPU and can therefore vary accross machines/nodes. Usage is:

python find_batch_size.py --config [config_file.yml]

Finally, to support larger batch sizes we can accumulate the gradients over batch sizes. In order to enable this and set the amount of accumulation you can add to your config file:

batch_size: 8               # batch_size per GPU (because of DDP)
accumulate_grad_batches: 2  # updates model after 2 batches per GPU

When using the distributed data processing (DDP) training strategy, the model is copied exactly on each GPU and they all see only a part of the data during the epoch. After processing 1 batch on each of the GPUs, the gradients from each of the GPUs are averaged and the models are updated. If we use gradient accumulation the gradients are instead averaged over several of such steps. The effective batch size of the model is therefore: n_gpus batch_size accumulate_grad_batches.