snakers4 commented 3 years ago

Setup

A multi-GPU rig, having top of the line GPUs:
- Several 3090 GPUs;
- Or several A100 GPUs;
A pytorch:1.7.0-cuda11.0-cudnn8-devel container derivative;
Latest docker, nvidia-docker, GPU drivers;
PyTorch 1.7;
Vanilla PyTorch code utilizing DDP (Dsitributed Data Parallel) and using one CUDA enumerated GPU per process;
NCCL backend;

Motivation

Current top-of-the-line GPUs are becoming too powerful and large for training compact networks with 100% utilization. It is desirable though to be able to parallelize workload across not only 4 GPUs as a whole, but 8 - 12 "sub-gpus" (or compute instances) without additional investment in hardware.

With CUDA 11, MIG mostly does not work for multi-GPU training (even though I could use a different CUDA_VISIBLE_DEVICES variable for each process, NCCL refused to run as well) and A100 for compact networks mostly remains under-utilized, which sucks.

See some more details here:

https://habr.com/ru/post/530986/ (Russian)

Current behavior

Trying to use NCCL with MIG or trying to run 2 DDP instances on one GPU (with new drivers and CUDA, with older ones - it kind of worked, but was slow) produces an exception:

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729096996/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8

Config looks something like this:

ddp:
  enabled: True
  world_size: 4
  dist_url: 'tcp://127.0.0.1:1550'
  dist_backend: 'nccl'
  devices: [0, 0, 1, 1]

Trying to use DDP with MIG with a setup like below w/o NCCL is just very slow.

ddp:
  enabled: True
  world_size: 3
  dist_url: 'tcp://127.0.0.1:1550'
  dist_backend: 'nccl'
  devices: [0, 0, 0]
  mig_devices: ['MIG-GPU-e6aa67a2-1b88-ad15-5bb8-ac9b1228d86f/3/0',
                        'MIG-GPU-e6aa67a2-1b88-ad15-5bb8-ac9b1228d86f/5/0',
                        'MIG-GPU-e6aa67a2-1b88-ad15-5bb8-ac9b1228d86f/6/0']

Desired Behavior

You can use NCCL to train multi-node networks spanning several devices with more than one process per GPU, i.e.:

GPU 0 - 1 process;
GPU 1 - 2 processes;
GPU 2 - 2 processes;

etc

Other Solutions / Hacks and Why They Are Bad

Reverting to older drivers is not sustainable and does not work for Ampere GPUs
Using PyTorch RPC framework - too much complication
Using MIG - does not work with NCCL

sjeaugey commented 3 years ago

I'm not sure what the request for NCCL is.

We do not support having more than one NCCL rank per GPU because it could hang (CUDA give no guarantee that both NCCL kernels will run concurrently). Also, the whole topology detection/search system relies on the notion that we need to go through each rank, with one rank per GPU. So it would be a lot of work to support that case even if it was guaranteed to work.

One solution could be to first launch a reduction local to each GPU then reduce with NCCL across the GPUs.

xutianming commented 3 years ago

I think using MIG with NCCL might be a great idea, since we can have more global ranks and better distributed scalability. Unfortunately, according to this nvidia blog, MIG sub-instances don't support NVLINK.

ljz756245026 commented 3 years ago

can we do distributed data parallel on GPU A100？ I also meet the problem

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729096996/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8

Although I use the newest pytorch version. There are .cuda problem.

RuntimeError: CUDA error: operation would make the legacy stream depend on a capturing blocking stream

When I run code like model.cuda(args.gpu) such problem happened.

Hzzone commented 3 years ago

MIG works only with gloo backend for ddp. 😢 It seems than nccl does not support.

AidenDurrant commented 2 years ago

Is there any update for NCCL with MIG devices or using MIG in distributed training in PyTorch?

NVIDIA / nccl

Feature request - using 2 GPU workers on one large GPU (A100, 3090) using DDP in PyTorch #431

Setup

Motivation

Current behavior

Desired Behavior

Other Solutions / Hacks and Why They Are Bad