Open christopherhesse opened 3 years ago
I am not sure why your program needs to create communicators num_ranks times. The initialization time is expected to grow quadratically in that case.
Sorry, the actual jobs we run create a different set of communicators. We need to create more like 7 communicators of different sizes, my point was that it seems to be pretty slow relative to other NCCL operations and we spend a decent amount of startup time just waiting for NCCL communicators to be created.
Here are the numbers for 7 communicators:
num_ranks=8
CREATED 7 comms in 10.403s
num_ranks=32
CREATED 7 comms in 17.701s
Indeed that looks better than the previous case. Thanks. In order to see if there is still room for improvement, can you share some information about the machines?
nvidia-smi topo -m
; or to get more detailed topology information by dumping a topology file from NCCL by setting NCCL_TOPO_DUMP_FILE=topo.xml
(if you are NVIDIA's DGX products, then we already have it and there is no need to dump)NCCL_DEBUG=INFO
and share the log file? We can check if there is anything abnormal in the initialization processThanks!
Our jobs use a large number of machines, but it looks like even the single node case is slow, so seems easiest to focus on that for the moment.
Here's a single machine run of the following script:
import os
os.environ["NCCL_TOPO_DUMP_FILE"] = "topo.xml"
os.environ["NCCL_DEBUG"] = "INFO"
import torch
import time
from mpi4py.MPI import COMM_WORLD as pycomm
from cupy.cuda import nccl
rank = pycomm.rank
world_size = pycomm.size
os.environ["CUDA_VISIBLE_DEVICES"] = str(rank % 8)
torch.cuda.set_device(0)
ids = pycomm.bcast([nccl.get_unique_id() for _ in range(world_size+1)] if rank == 0 else None, root=0)
# create a single communicator in case there is some warmup time
nccl.NcclCommunicator(world_size, ids.pop(0), rank)
pycomm.Barrier()
start = time.time()
comms = []
for i in range(7):
comm = nccl.NcclCommunicator(world_size, ids[i], rank)
comms.append(comm)
pycomm.Barrier()
duration = time.time() - start
print(f"CREATED {len(comms)} comms in {duration:0.3f}s", flush=True)
I switched it to make 1 communicator first in case there was some warmup for the first communicator. It still takes about 1 second per communicator within a single machine.
That was with mpiexec -n 8
and these are 8 GPU machines.
I timed the graph search on your system's topology (using the provided XML, thanks) and I see it takes ~100 ms for intra-node and ~300 ms for inter-node. This is surely a part of the 1s total, but doesn't explain everything.
Did you try to run the benchmark without setting NCCL_DEBUG=INFO
? That can make init much slower.
Thanks for looking into it!
Still 6.9 seconds even with NCCL_DEBUG
not set. Here's an nsight systems trace of one of the inits:
The first 400ms appears to be lots of tiny calls to stuff like nvmlDeviceGetNvLinkRemotePciInfo
nvmlDeviceGetNvLinkCapability
and nvmlDeviceGetNvLinkState
. The two slow ioctls are in ibv_reg_mr_iova
.
Then there's a gap of 200ms which could be the graph search you're talking about (this is intra-node with 8 procs so not sure why it's 200ms instead of 100ms).
Then there's a 530ms block, with a couple of slow accept calls (7 of the top 10 events by duration are accept calls, for a total of 200ms) with some time in ioctl maybe waiting for cudaIpcOpenMemHandle
to complete.
Thanks for the feedback. At this point, I'd say that the time you see seems normal. We never really tried to optimize the init performance, just tried to keep it reasonable as we scale to large numbers of ranks.
Now if applications start to create many NCCL communicators, I agree it would be good to see if we can optimize the process a bit, to a few 100 ms rather than seconds.
On a different cluster (with a different machine type), I got the following time for a single machine:
CREATED 7 comms in 31.316s
For 16 machines:
CREATED 7 comms in 41.674s
Do you think that is worth more investigation? mpiexec
just running "x = 1" across 16 machines seems to take < 200ms, though I'm not creating any new MPI groups or anything (so mpiexec startup time seems to be a relatively low percentage of overall startup time).
If each NCCL init takes 5-6 seconds, indeed, that's a lot more than it should. It would be good to figure out what takes so much time; there might be something wrong either with the setup or in NCCL. In any case, we'd want to know about it.
Things you might want to check is that persistence mode is enabled, and the out-of-band (OOB) network used by NCCL is fast (controlled by NCCL_SOCKET_IFNAME
).
nvidia-smi -q | grep Persist
Persistence Mode : Enabled
Persistence Mode : Enabled
Persistence Mode : Enabled
Persistence Mode : Enabled
Persistence Mode : Enabled
Persistence Mode : Enabled
Persistence Mode : Enabled
Persistence Mode : Enabled
The OOB network should be fairly fast, though does that still matter in the single node case? The single node case on this cluster is still taking > 30s for 7 communicators.
That's indeed way slower than it should be. If you can profile the init time as you did above it would be very helpful to understand what part takes that much time.
Nothing stands out to me, just looks like a bunch of tiny ops (mostly ioctl):
The "maybe graph search" section is also 600ms instead of 200ms, so it's like everything is just 3+ times slower. This is different hardware though, so maybe the different configuration accounts for the change.
We are also experiencing long intialisation times in the xgboost project. E.g. in our case a DGX system with 8 V100 gpus is taking 4 seconds to initialise. While not critical for larger jobs, nccl is by itself significantly increasing the latency of our application when smaller jobs are launched.
Just wanted to add my 'vote' that I think this issue is important.
I'd also like to report what seems to be a similar issue, we see very high startup times for pytorch jobs.
Trival reproducer is this (assume an 8 gpu H100 system, with all 8 ranks running on 1 machine)
dist.init_process_group( backend='nccl', rank=get_process_rank(), # 8 procs with rank in range 0 - 7 world_size=8, init_method="file:///tmp/rvd_file", # using ip/port yields similar timings ) t1=time.time() dist.barrier() t2=time.time() print(t2-t1)
on our system this takes ~18seconds.. setting NCCL_MAX_NCHANNELS=1 reduces the penalty to 7.5s to pass barrier.
NOTING;
We encourage you to retry with NCCL 2.22, as optimizing the init time was one of the focus areas for that release. Not only should the init time be shorter, but NCCL_DEBUG=INFO
will also provide fine-grained timing info that should aide to explain what's going on.
Thanks for the quick response @kiskra-nvidia ! Just confirming 2.22 does improve things .Timings now look like this, would you mind confirming if this is inline with current best expectations ?
Time to pass barrier (single machine, 8 x h100) reduces from
Yeah, the 2.5s for comm init looks healthy. You can still try cutting down on the time of the first barrier via other tricks (you already use NCCL_MAX_NCHANNELS; does NCCL_NVLS_ENABLE=0 have any effect?) but keep in mind that, depending on your workload, that could speed up the initialization but slow down the communication...
thanks again @kiskra-nvidia
For the record:
with nccl=2.22 , NCCL_DEBUG reported Init timings. no tweaks: 2.44
For same test which does an init_process_group(), and passes 1 barrier() no tweaks: 12.244s
Noted ref your comment on possible performance impact, thats understood.
Creating a new communicator seems surprisingly slow and is a substantial portion of job startup time for our jobs.
Here's a short script illustrating the issue using
cupy
's NCCL bindings:With 8 ranks, this takes 16.3s, and with 32 ranks it takes 130 seconds. The actual jobs we run create a different set of communicators than this (this is just the same configuration num_ranks times) but hopefully this illustrates the issue.
We are using CUDA 11.3, V100 GPUs, and Infiniband.
Any idea why this operation is so slow? Is it inherent to how NCCL works?