slow ncclCommInitRank - Githubissues

christopherhesse commented 3 years ago

Creating a new communicator seems surprisingly slow and is a substantial portion of job startup time for our jobs.

Here's a short script illustrating the issue using cupy's NCCL bindings:

import torch
import time
from mpi4py.MPI import COMM_WORLD as pycomm
from cupy.cuda import nccl

rank = pycomm.rank
world_size = pycomm.size

torch.cuda.set_device(rank % 8)

ids = pycomm.bcast([nccl.get_unique_id() for _ in range(world_size)] if rank == 0 else None, root=0)

pycomm.Barrier()
start = time.time()
comms = []
for i in range(world_size):
    comm = nccl.NcclCommunicator(world_size, ids[i], rank)
    comms.append(comm)

pycomm.Barrier()
duration = time.time() - start
print(f"CREATED {len(comms)} comms in {duration:0.3f}s", flush=True)

With 8 ranks, this takes 16.3s, and with 32 ranks it takes 130 seconds. The actual jobs we run create a different set of communicators than this (this is just the same configuration num_ranks times) but hopefully this illustrates the issue.

We are using CUDA 11.3, V100 GPUs, and Infiniband.

Any idea why this operation is so slow? Is it inherent to how NCCL works?

kwen2501 commented 3 years ago

I am not sure why your program needs to create communicators num_ranks times. The initialization time is expected to grow quadratically in that case.

christopherhesse commented 3 years ago

Sorry, the actual jobs we run create a different set of communicators. We need to create more like 7 communicators of different sizes, my point was that it seems to be pretty slow relative to other NCCL operations and we spend a decent amount of startup time just waiting for NCCL communicators to be created.

Here are the numbers for 7 communicators:

num_ranks=8

CREATED 7 comms in 10.403s

num_ranks=32

CREATED 7 comms in 17.701s

kwen2501 commented 3 years ago

Indeed that looks better than the previous case. Thanks. In order to see if there is still room for improvement, can you share some information about the machines?

what machine do you have? Do they have 8 GPUs each or 16 GPUs each?
what is the intranode topology? You can get some basic topology information with this command: nvidia-smi topo -m; or to get more detailed topology information by dumping a topology file from NCCL by setting NCCL_TOPO_DUMP_FILE=topo.xml (if you are NVIDIA's DGX products, then we already have it and there is no need to dump)
can you please set NCCL_DEBUG=INFO and share the log file? We can check if there is anything abnormal in the initialization process

Thanks!

christopherhesse commented 3 years ago

Our jobs use a large number of machines, but it looks like even the single node case is slow, so seems easiest to focus on that for the moment.

Here's a single machine run of the following script:

import os
os.environ["NCCL_TOPO_DUMP_FILE"] = "topo.xml"
os.environ["NCCL_DEBUG"] = "INFO"

import torch
import time
from mpi4py.MPI import COMM_WORLD as pycomm
from cupy.cuda import nccl

rank = pycomm.rank
world_size = pycomm.size

os.environ["CUDA_VISIBLE_DEVICES"] = str(rank % 8)
torch.cuda.set_device(0)

ids = pycomm.bcast([nccl.get_unique_id() for _ in range(world_size+1)] if rank == 0 else None, root=0)

# create a single communicator in case there is some warmup time
nccl.NcclCommunicator(world_size, ids.pop(0), rank)

pycomm.Barrier()
start = time.time()
comms = []
for i in range(7):
    comm = nccl.NcclCommunicator(world_size, ids[i], rank)
    comms.append(comm)

pycomm.Barrier()
duration = time.time() - start
print(f"CREATED {len(comms)} comms in {duration:0.3f}s", flush=True)

output.txt topo.xml.txt

I switched it to make 1 communicator first in case there was some warmup for the first communicator. It still takes about 1 second per communicator within a single machine.

christopherhesse commented 3 years ago

That was with mpiexec -n 8 and these are 8 GPU machines.

sjeaugey commented 3 years ago

I timed the graph search on your system's topology (using the provided XML, thanks) and I see it takes ~100 ms for intra-node and ~300 ms for inter-node. This is surely a part of the 1s total, but doesn't explain everything.

Did you try to run the benchmark without setting NCCL_DEBUG=INFO? That can make init much slower.

christopherhesse commented 3 years ago

Thanks for looking into it!

Still 6.9 seconds even with NCCL_DEBUG not set. Here's an nsight systems trace of one of the inits:

The first 400ms appears to be lots of tiny calls to stuff like nvmlDeviceGetNvLinkRemotePciInfo nvmlDeviceGetNvLinkCapability and nvmlDeviceGetNvLinkState. The two slow ioctls are in ibv_reg_mr_iova.

Then there's a gap of 200ms which could be the graph search you're talking about (this is intra-node with 8 procs so not sure why it's 200ms instead of 100ms).

Then there's a 530ms block, with a couple of slow accept calls (7 of the top 10 events by duration are accept calls, for a total of 200ms) with some time in ioctl maybe waiting for cudaIpcOpenMemHandle to complete.

sjeaugey commented 3 years ago

Thanks for the feedback. At this point, I'd say that the time you see seems normal. We never really tried to optimize the init performance, just tried to keep it reasonable as we scale to large numbers of ranks.

Now if applications start to create many NCCL communicators, I agree it would be good to see if we can optimize the process a bit, to a few 100 ms rather than seconds.

christopherhesse commented 3 years ago

On a different cluster (with a different machine type), I got the following time for a single machine:

CREATED 7 comms in 31.316s

For 16 machines:

CREATED 7 comms in 41.674s

Do you think that is worth more investigation? mpiexec just running "x = 1" across 16 machines seems to take < 200ms, though I'm not creating any new MPI groups or anything (so mpiexec startup time seems to be a relatively low percentage of overall startup time).

sjeaugey commented 3 years ago

If each NCCL init takes 5-6 seconds, indeed, that's a lot more than it should. It would be good to figure out what takes so much time; there might be something wrong either with the setup or in NCCL. In any case, we'd want to know about it.

Things you might want to check is that persistence mode is enabled, and the out-of-band (OOB) network used by NCCL is fast (controlled by NCCL_SOCKET_IFNAME).

christopherhesse commented 3 years ago

nvidia-smi -q | grep Persist
    Persistence Mode                      : Enabled
    Persistence Mode                      : Enabled
    Persistence Mode                      : Enabled
    Persistence Mode                      : Enabled
    Persistence Mode                      : Enabled
    Persistence Mode                      : Enabled
    Persistence Mode                      : Enabled
    Persistence Mode                      : Enabled

The OOB network should be fairly fast, though does that still matter in the single node case? The single node case on this cluster is still taking > 30s for 7 communicators.

sjeaugey commented 3 years ago

That's indeed way slower than it should be. If you can profile the init time as you did above it would be very helpful to understand what part takes that much time.

christopherhesse commented 3 years ago

Nothing stands out to me, just looks like a bunch of tiny ops (mostly ioctl):

The "maybe graph search" section is also 600ms instead of 200ms, so it's like everything is just 3+ times slower. This is different hardware though, so maybe the different configuration accounts for the change.

RAMitchell commented 2 years ago

We are also experiencing long intialisation times in the xgboost project. E.g. in our case a DGX system with 8 V100 gpus is taking 4 seconds to initialise. While not critical for larger jobs, nccl is by itself significantly increasing the latency of our application when smaller jobs are launched.

Just wanted to add my 'vote' that I think this issue is important.

blegear commented 3 months ago

I'd also like to report what seems to be a similar issue, we see very high startup times for pytorch jobs.

Trival reproducer is this (assume an 8 gpu H100 system, with all 8 ranks running on 1 machine) dist.init_process_group( backend='nccl', rank=get_process_rank(), # 8 procs with rank in range 0 - 7 world_size=8, init_method="file:///tmp/rvd_file", # using ip/port yields similar timings ) t1=time.time() dist.barrier() t2=time.time() print(t2-t1)

on our system this takes ~18seconds.. setting NCCL_MAX_NCHANNELS=1 reduces the penalty to 7.5s to pass barrier.

NOTING;

persistence mode is confirmed to be on
NCCL_SOCKET_IFNAME shoudlnt be an issue setting to local loopback for a 1 machine test doesnt benefit timings vs our default.
attached a topo + a NCCL_DEBUG
nccl.debug.txt topo.xml.txt

kiskra-nvidia commented 3 months ago

We encourage you to retry with NCCL 2.22, as optimizing the init time was one of the focus areas for that release. Not only should the init time be shorter, but NCCL_DEBUG=INFO will also provide fine-grained timing info that should aide to explain what's going on.

blegear commented 3 months ago

Thanks for the quick response @kiskra-nvidia ! Just confirming 2.22 does improve things .Timings now look like this, would you mind confirming if this is inline with current best expectations ?

Time to pass barrier (single machine, 8 x h100) reduces from

~18s with NCCL=2.19.3
~9s with NCCL=2.22.3

mytimings.txt

kiskra-nvidia commented 3 months ago

Yeah, the 2.5s for comm init looks healthy. You can still try cutting down on the time of the first barrier via other tricks (you already use NCCL_MAX_NCHANNELS; does NCCL_NVLS_ENABLE=0 have any effect?) but keep in mind that, depending on your workload, that could speed up the initialization but slow down the communication...

blegear commented 3 months ago

thanks again @kiskra-nvidia

For the record:

with nccl=2.22 , NCCL_DEBUG reported Init timings. no tweaks: 2.44

NCCL_MAX_NCHANNELS=1 : 2.41
NCCL_NVLS_ENABLE=0 : 2.33

For same test which does an init_process_group(), and passes 1 barrier() no tweaks: 12.244s

NCCL_MAX_NCHANNELS=1 : 10.447s
NCCL_NVLS_ENABLE=0 : 6.868

Noted ref your comment on possible performance impact, thats understood.

nccl_startup_times.txt

NVIDIA / nccl

slow ncclCommInitRank #534