Open snakers4 opened 3 years ago
I'm not sure what the request for NCCL is.
We do not support having more than one NCCL rank per GPU because it could hang (CUDA give no guarantee that both NCCL kernels will run concurrently). Also, the whole topology detection/search system relies on the notion that we need to go through each rank, with one rank per GPU. So it would be a lot of work to support that case even if it was guaranteed to work.
One solution could be to first launch a reduction local to each GPU then reduce with NCCL across the GPUs.
I think using MIG with NCCL might be a great idea, since we can have more global ranks and better distributed scalability. Unfortunately, according to this nvidia blog, MIG sub-instances don't support NVLINK.
can we do distributed data parallel on GPU A100? I also meet the problem
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729096996/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
Although I use the newest pytorch version. There are .cuda problem.
RuntimeError: CUDA error: operation would make the legacy stream depend on a capturing blocking stream
When I run code like model.cuda(args.gpu)
such problem happened.
MIG works only with gloo backend for ddp. 😢 It seems than nccl does not support.
Is there any update for NCCL with MIG devices or using MIG in distributed training in PyTorch?
Setup
pytorch:1.7.0-cuda11.0-cudnn8-devel
container derivative;docker
,nvidia-docker
, GPU drivers;Motivation
Current top-of-the-line GPUs are becoming too powerful and large for training compact networks with 100% utilization. It is desirable though to be able to parallelize workload across not only 4 GPUs as a whole, but 8 - 12 "sub-gpus" (or compute instances) without additional investment in hardware.
With CUDA 11, MIG mostly does not work for multi-GPU training (even though I could use a different
CUDA_VISIBLE_DEVICES
variable for each process, NCCL refused to run as well) andA100
for compact networks mostly remains under-utilized, which sucks.See some more details here:
Current behavior
Trying to use NCCL with MIG or trying to run 2 DDP instances on one GPU (with new drivers and CUDA, with older ones - it kind of worked, but was slow) produces an exception:
Config looks something like this:
Trying to use DDP with MIG with a setup like below w/o NCCL is just very slow.
Desired Behavior
You can use NCCL to train multi-node networks spanning several devices with more than one process per GPU, i.e.:
etc
Other Solutions / Hacks and Why They Are Bad