Open yanc11 opened 4 years ago
I have faced a similar problem overlapping NCCL send and PyTorch training. The backward process of training gets slower if I overlap NCCL send and PyTorch training using python multi-threading. Have you figured out a solution for this problem yet?
I have seen exactly the same as @yanc11 and drawn the same conclusion (resource competition). See below the computational portion of the backwards pass carried out on one vs. two gpus in parallel. When training in parallel on two gpus the backwards computation time increases from 230ms to 330ms 1 GPU
2 GPUs (PyTorch DistributedDataParallel with NCCL AllReduce)
It's not really surprising to me, as NCCL uses a few SMs, and also some of the PCI bandwidth and CPU bandwidth (if you don't have NVLink). So that could indeed slow down the rest of the compute workload. Besides, it seems you have a lot of NCCL calls, whereas I would imagine when running NCCL in a non-overlapped manner, you would have a single big NCCL allreduce call at the end which would probably run faster than the sum of all the small operations.
So overlapping is always a tricky balance of how much to aggregate, and how much impact on the rest it has. Sometimes it can be better than not-overlapping (with the right tuning of operation size), sometimes not. So in general, I would not try to overlap NCCL operations with the backward pass, since it's a time consuming process and the performance gain is uncertain.
I am observing a similar problem specific to cudnn::batch_norm_backward (specifically with NHWC kernel batchnorm_bwtr_nhwc_semiPersist
).
slow kernel
fast batch_norm_backwards kernels that were found right before the slow one, overlapping the same NCCL all_reduce
Do you think this could be a cudnn bug?
The NCHW version bn_bw_1C11_singleread_specialized
does not exhibit this behavior.
I'm not expert enough on how cudnn works to confirm, but "persistent" kernels usually try to use the whole GPU, so if they don't account for NCCL using a part of the SMs, their performance could be significantly impacted.
I am seeing the same issue.
I used the environment from https://github.com/NVIDIA/DeepLearningExamples/tree/master/MxNet/Classification/RN50v1.5 to train resnet-50 with multiple GPUs (with horovod using nccl), and found the duration of each training step is much longer than training with single GPU.
Then I profiled with Nsight System, and found that "batch norm backward kernel" overlapped with "nccl allreduce kernel" in different gpu stream was much slower than not overlapped ones (4~8ms vs 2ms), like this.
I also reproduced it without resnet, which means only keep calling batch norm backward and nccl allreduce in 2 threads, like this.
My problem is why computation such as batch norm get much slower when overlapped with nccl allreduce? What kind of resources are they competing for? What can I do to avoid it?