NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.14k stars 791 forks source link

Pack bootstrapAllGather's [2] #1156

Open kwen2501 opened 8 months ago

kwen2501 commented 8 months ago

https://github.com/NVIDIA/nccl/blob/b6d7438d3145a619f924dbbca6c96db21fab716e/src/init.cc#L617 is called in a for loop: https://github.com/NVIDIA/nccl/blob/b6d7438d3145a619f924dbbca6c96db21fab716e/src/init.cc#L599

To reduce the number of bootstrapAllGather calls at scale, would it be better to break out of the for loop at bootstrapAllGather, pack data of all loops into one bootstrapAllGather, and re-enter the for loop after?

sjeaugey commented 8 months ago

It would be interesting to benchmark bootstrapAllGather at scale and see what is the order of magnitude we should expect depending on the number of ranks. That way we'd have a good understanding of potential issues at scale (or not). Do you anticipate (or even experience) that at scale, the bootstrap allgather time is growing so much that it becomes a significant portion of the init time?

kwen2501 commented 8 months ago

Thanks @sjeaugey. We are experiencing 2.5-minute init time at large scale (e.g. above certain K GPUs). Would like to see if there is way to drive this down so that the infrastructure's "idle" time becomes as small as possible. This "idling" may occur at initial training launch or restart of training (e.g. to recover from GPU failure, rate of which is non-negligible at large scale).

The 2.5-min init time is from the default init path. So indeed neither #1155 nor #1156 would help. Just checking those in in case a user would trigger ncclCommSplit (#1155) or SHARP (#1156).

But it would be much appreciated if the init time of the default path can be minimized (more important). One thought is: can user provide a guarantee to NCCL (via an env var, e.g.) that all GPUs' detections would be homogeneous? So that the following code can be based on local detection result: https://github.com/NVIDIA/nccl/blob/b6d7438d3145a619f924dbbca6c96db21fab716e/src/init.cc#L1027-L1034

Of course, the bootstrapAllgather of allGather3Data not only includes graphInfo but also topoRanks (so we cannot delete this allgather at once). It may seem like topoRanks can be all-gathered by head of nodes only? But it will require inter-node bootstrap comm though. Not sure if it is worth it since creation of the inter-node comm also takes time.

Just wild thoughts. The team may have better ideas :) Thanks!

sjeaugey commented 8 months ago

Ok, thanks for the feedback. 2.5 minutes seems indeed large to me, and it would be good to have some sort of timing statistics throughout ncclCommInit to find out what is taking time and how much it grows as we scale.

Concretely, the main reasons for time spend in ncclCommInit I can think of are:

So I'd really be curious to confirm what phase is growing in time as we scale. There is probably something I'm missing.

kwen2501 commented 8 months ago

Thanks @sjeaugey for the breakdown. We will try to gather detailed timing statistics (using TRACE=1) and report back.

Meanwhile, FYI, we are making the nonblocking API mode the default in PyTorch: https://github.com/pytorch/pytorch/issues/117749

There is one thing I'd like to confirm:

If there are other things we should be aware of, we would appreciate your guidance too. Thanks!

sjeaugey commented 8 months ago

I'm not sure it is valid to call ncclAllReduce on a communicator which has not yet been initialized. In other words, I think that you are supposed to call ncclCommGetAsyncError until it returns ncclSuccess, then you are allowed to call ncclAllReduce. Otherwise, you would not be able to know which operation has failed with a given error (if NCCL ends up returning an error).

CC @KaimingOuyang in case I'm missing something.

KaimingOuyang commented 8 months ago

Yes. You have to wait until comm is fully initialized, otherwise ncclAllreduce would just fail.

kwen2501 commented 8 months ago

Hmm, interesting, I thought in nonblocking mode, NCCL errors are allowed to be "sticky" (like CUDA), which means ncclAllreduce can return errors from ncclCommInitRankConfig.

I can add a manual ncclCommGetAsyncError loop before the first collective call. But in general, this may not be enough, as no one knows what the first NCCL call would be.

I am curious what makes a better UI? An alternative to user-side check is for ncclAllreduce to have a ncclCommGetAsyncError check internally. If ncclAllreduce has a side thread, it can block on that; but the main would continue to return ncclInProgress or any error that has been observed so far.

The goal here is to allow the main thread to go ahead.