Training slows (to half) if a second instance of docker container is started

prrw commented 2 weeks ago

Describe the bug I am using the docker container to train. Since I have more GPU and RAM ressources available even after starting the first training, I want to train with another instance in parallel. Hence I am starting another instance of docker container and starting the training there as well. As soon as I start the training in the second instance, the training slows down for both the containers. The training slows down to half speed more or less which means there is no advantage to train parallely. Upon stopping one of the docker containers, the training speed increases back to normal.

To Reproduce Start training in 2 separate docker containers on Linux docker run --gpus all -it --rm -vpwd:/scratch --user $(id -u):$(id -g) --workdir=/scratch -e HOME=/scratch sg2ada:latest bash python train.py --outdir=training_runs --data=datasets/dataset.zip --kimg=5000 --gpus=2

Same command again: docker run --gpus all -it --rm -vpwd:/scratch --user $(id -u):$(id -g) --workdir=/scratch -e HOME=/scratch sg2ada:latest bash python train.py --outdir=training_runs --data=datasets/dataset.zip --kimg=5000 --gpus=2

Expected behavior Considering that my system has ressources available for training, I expect both the training process to run at same speeds. But it is not the case.

Desktop (please complete the following information):

OS: Linux Ubuntu 22.04
NVIDIA driver version: 550.120
GPU [RTX 6000 ADA, 2GPUs]
Docker: did you use Docker? Yes. Image: nvcr.io/nvidia/pytorch:20.12-py3

Is this docker related ? How can I train in parallel without slowing training ?

thorinf commented 2 weeks ago

I think it's important to explain that GPUs (and AI accelerators) are massively parallel. However, they aren't entirely parallel. The operations the GPU will be executing are still chunked when processed by the many many threads the GPU has, but because its chunked there is a degree of serialisation.

This is important to explain because it makes it clearer that doubling the workload will likely double the time required to process.

I would suggest that if you really want to run 2 experiments at once then you run one on GPU 0 and another on GPU 1, but with the same global batch size. Running 2 experiments that are both using 2 GPUs, as you've done, is slightly slower.

This is because when running multi-GPU training runs the GPUs have to communicate with each other to aggregate the gradients. Aggregation is only necessary on multi-GPU runs, and it can be a little slow especially if you don't have a fast GPU interconnect like NVLink available.

There may be other reasons why your training is going slower. For example by running 2 experiments the CPU may no longer be able to produce data fast enough to feed the GPUS. I think this is unlikely as image loading isn't very heavy. You can usually tell if this is the case buy running watch -n 0.1 nvidia-smi and seeing the GPU utility drop frequently.

prrw commented 2 weeks ago

Hey @thorinf , thanks for the detailed reply. I think the first reason you mentioned is correct. In fact, i wanted to try out exactly the same solution which you proposed to see if GPU is the bottleneck. I ran nvidia-smi and did not find any significant differences. So I will try the training with single GPUs.

thorinf commented 1 week ago

Distributing across many GPUs does have a cost associated, but across 2 GPUs it probably won't be that much. I'd expect a little speedup since you are planning to run two training runs, but not much.

NVlabs / stylegan2-ada-pytorch

Training slows (to half) if a second instance of docker container is started #310