Pack bootstrapAllGather's [2]

kwen2501 commented 8 months ago

https://github.com/NVIDIA/nccl/blob/b6d7438d3145a619f924dbbca6c96db21fab716e/src/init.cc#L617 is called in a for loop: https://github.com/NVIDIA/nccl/blob/b6d7438d3145a619f924dbbca6c96db21fab716e/src/init.cc#L599

To reduce the number of bootstrapAllGather calls at scale, would it be better to break out of the for loop at bootstrapAllGather, pack data of all loops into one bootstrapAllGather, and re-enter the for loop after?

sjeaugey commented 8 months ago

It would be interesting to benchmark bootstrapAllGather at scale and see what is the order of magnitude we should expect depending on the number of ranks. That way we'd have a good understanding of potential issues at scale (or not). Do you anticipate (or even experience) that at scale, the bootstrap allgather time is growing so much that it becomes a significant portion of the init time?

kwen2501 commented 8 months ago

Thanks @sjeaugey. We are experiencing 2.5-minute init time at large scale (e.g. above certain K GPUs). Would like to see if there is way to drive this down so that the infrastructure's "idle" time becomes as small as possible. This "idling" may occur at initial training launch or restart of training (e.g. to recover from GPU failure, rate of which is non-negligible at large scale).

The 2.5-min init time is from the default init path. So indeed neither #1155 nor #1156 would help. Just checking those in in case a user would trigger ncclCommSplit (#1155) or SHARP (#1156).

But it would be much appreciated if the init time of the default path can be minimized (more important). One thought is: can user provide a guarantee to NCCL (via an env var, e.g.) that all GPUs' detections would be homogeneous? So that the following code can be based on local detection result: https://github.com/NVIDIA/nccl/blob/b6d7438d3145a619f924dbbca6c96db21fab716e/src/init.cc#L1027-L1034

Of course, the bootstrapAllgather of allGather3Data not only includes graphInfo but also topoRanks (so we cannot delete this allgather at once). It may seem like topoRanks can be all-gathered by head of nodes only? But it will require inter-node bootstrap comm though. Not sure if it is worth it since creation of the inter-node comm also takes time.

Just wild thoughts. The team may have better ideas :) Thanks!

sjeaugey commented 8 months ago

Ok, thanks for the feedback. 2.5 minutes seems indeed large to me, and it would be good to have some sort of timing statistics throughout ncclCommInit to find out what is taking time and how much it grows as we scale.

Concretely, the main reasons for time spend in ncclCommInit I can think of are:

The bootstrap init. This is a linear step where each rank checks in, sends data to the root, then gets data from the root about its neighbors. This has the potential to grow as we scale, but I don't think we've experienced major issues yet on that phase even at large scale.
The topology detection, graph search, etc. This can take a significant amount of time (up to seconds) but should be constant as we scale.
Edit: forgot that important one of course: the allgather operations. These should grow linearly but their performance may depend heavily on the network used for out of band. Using IPoIB or a fast NIC in general can make those operations much faster.
The Ring / Tree / NVLSTree / .. connections. These allocate buffers which usually takes time, plus they exchange data through out of band (bootstrap) with the peers they're going to communicate with. We might also connect a lot of channels which multiplies the time. But as we scale, ring connection time should be constant (we only talk to 2 peers) and tree connection should only grow with log2(nnodes). So it should not increase that much as we scale.
Final synchronization and intra-node connections (NVB/PXN). Again those should be intra-node only, hence not grow as we scale.

So I'd really be curious to confirm what phase is growing in time as we scale. There is probably something I'm missing.

kwen2501 commented 8 months ago

Thanks @sjeaugey for the breakdown. We will try to gather detailed timing statistics (using TRACE=1) and report back.

Meanwhile, FYI, we are making the nonblocking API mode the default in PyTorch: https://github.com/pytorch/pytorch/issues/117749

There is one thing I'd like to confirm:

To allow the main thread to go ahead, we plan to not wrap ncclCommInitRankConfig with timeout check loop. Instead, we will wrap the first collective. Is it correct understanding that the first collective would keep returning ncclInProgress if the comm isn't ready at that point? (We would like to allow CPU go-ahead while not losing fault tolerance).

If there are other things we should be aware of, we would appreciate your guidance too. Thanks!

sjeaugey commented 8 months ago

I'm not sure it is valid to call ncclAllReduce on a communicator which has not yet been initialized. In other words, I think that you are supposed to call ncclCommGetAsyncError until it returns ncclSuccess, then you are allowed to call ncclAllReduce. Otherwise, you would not be able to know which operation has failed with a given error (if NCCL ends up returning an error).

CC @KaimingOuyang in case I'm missing something.

KaimingOuyang commented 8 months ago

Yes. You have to wait until comm is fully initialized, otherwise ncclAllreduce would just fail.

kwen2501 commented 8 months ago

Hmm, interesting, I thought in nonblocking mode, NCCL errors are allowed to be "sticky" (like CUDA), which means ncclAllreduce can return errors from ncclCommInitRankConfig.

I can add a manual ncclCommGetAsyncError loop before the first collective call. But in general, this may not be enough, as no one knows what the first NCCL call would be.

I am curious what makes a better UI? An alternative to user-side check is for ncclAllreduce to have a ncclCommGetAsyncError check internally. If ncclAllreduce has a side thread, it can block on that; but the main would continue to return ncclInProgress or any error that has been observed so far.

The goal here is to allow the main thread to go ahead.

NVIDIA / nccl

Pack bootstrapAllGather's [2] #1156