NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.28k stars 831 forks source link

computation overlapped with nccl get much slower #338

Open yanc11 opened 4 years ago

yanc11 commented 4 years ago

I used the environment from https://github.com/NVIDIA/DeepLearningExamples/tree/master/MxNet/Classification/RN50v1.5 to train resnet-50 with multiple GPUs (with horovod using nccl), and found the duration of each training step is much longer than training with single GPU.

Then I profiled with Nsight System, and found that "batch norm backward kernel" overlapped with "nccl allreduce kernel" in different gpu stream was much slower than not overlapped ones (4~8ms vs 2ms), like this. image

I also reproduced it without resnet, which means only keep calling batch norm backward and nccl allreduce in 2 threads, like this.

import mxnet as mx
import mxnet.ndarray as nd
import numpy as np
from mxnet import gluon
from mxnet.gluon import nn
from mxnet import autograd as ag
import horovod.mxnet as hvd
import time
import _thread

def build_model():
    net = nn.HybridSequential()
    for i in range(20):
        net.add(nn.BatchNorm(axis=3, momentum=0.9, epsilon=1e-5, act_type='relu'))
    net.hybridize(static_shape=True, static_alloc=True)
    net.cast('float16')
    return net

def nccl_allreduce():
    param_num = 25*1024*1024 # 25M
    data = np.random.uniform(-1, 1, [param_num])
    m = nd.array(data, dtype='float16', ctx=mx.gpu(hvd.local_rank()))
    while True:
        reduced = hvd.allreduce(m)
        reduced.wait_to_read()

def bn():
    model = build_model()
    model.initialize(mx.init.Initializer(), mx.gpu(hvd.local_rank()))
    while True:
        data2 = np.random.uniform(-1, 1, [192,112,112,4]) # NHWC
        x = nd.array(data2, dtype='float16', ctx=mx.gpu(hvd.local_rank()))
        x.attach_grad()
        with ag.record():
            z = model(x)
        dx = ag.grad(z, [x])

if __name__ == "__main__":
    hvd.init()
    _thread.start_new_thread(nccl_allreduce, ())
    bn()

My problem is why computation such as batch norm get much slower when overlapped with nccl allreduce? What kind of resources are they competing for? What can I do to avoid it?

gudiandian commented 4 years ago

I have faced a similar problem overlapping NCCL send and PyTorch training. The backward process of training gets slower if I overlap NCCL send and PyTorch training using python multi-threading. Have you figured out a solution for this problem yet?

david-macleod commented 4 years ago

I have seen exactly the same as @yanc11 and drawn the same conclusion (resource competition). See below the computational portion of the backwards pass carried out on one vs. two gpus in parallel. When training in parallel on two gpus the backwards computation time increases from 230ms to 330ms 1 GPU image

2 GPUs (PyTorch DistributedDataParallel with NCCL AllReduce) image

sjeaugey commented 4 years ago

It's not really surprising to me, as NCCL uses a few SMs, and also some of the PCI bandwidth and CPU bandwidth (if you don't have NVLink). So that could indeed slow down the rest of the compute workload. Besides, it seems you have a lot of NCCL calls, whereas I would imagine when running NCCL in a non-overlapped manner, you would have a single big NCCL allreduce call at the end which would probably run faster than the sum of all the small operations.

So overlapping is always a tricky balance of how much to aggregate, and how much impact on the rest it has. Sometimes it can be better than not-overlapping (with the right tuning of operation size), sometimes not. So in general, I would not try to overlap NCCL operations with the backward pass, since it's a time consuming process and the performance gain is uncertain.

zhengwy888 commented 2 years ago

I am observing a similar problem specific to cudnn::batch_norm_backward (specifically with NHWC kernel batchnorm_bwtr_nhwc_semiPersist).

slow kernel

image

fast batch_norm_backwards kernels that were found right before the slow one, overlapping the same NCCL all_reduce

image

Do you think this could be a cudnn bug?

The NCHW version bn_bw_1C11_singleread_specialized does not exhibit this behavior.

sjeaugey commented 2 years ago

I'm not expert enough on how cudnn works to confirm, but "persistent" kernels usually try to use the whole GPU, so if they don't account for NCCL using a part of the SMs, their performance could be significantly impacted.

aazzolini commented 1 year ago

I am seeing the same issue.