Open kwen2501 opened 8 months ago
It would be interesting to benchmark bootstrapAllGather at scale and see what is the order of magnitude we should expect depending on the number of ranks. That way we'd have a good understanding of potential issues at scale (or not). Do you anticipate (or even experience) that at scale, the bootstrap allgather time is growing so much that it becomes a significant portion of the init time?
Thanks @sjeaugey. We are experiencing 2.5-minute init time at large scale (e.g. above certain K GPUs). Would like to see if there is way to drive this down so that the infrastructure's "idle" time becomes as small as possible. This "idling" may occur at initial training launch or restart of training (e.g. to recover from GPU failure, rate of which is non-negligible at large scale).
The 2.5-min init time is from the default init path. So indeed neither #1155 nor #1156 would help. Just checking those in in case a user would trigger ncclCommSplit (#1155) or SHARP (#1156).
But it would be much appreciated if the init time of the default path can be minimized (more important). One thought is: can user provide a guarantee to NCCL (via an env var, e.g.) that all GPUs' detections would be homogeneous? So that the following code can be based on local detection result: https://github.com/NVIDIA/nccl/blob/b6d7438d3145a619f924dbbca6c96db21fab716e/src/init.cc#L1027-L1034
Of course, the bootstrapAllgather of allGather3Data
not only includes graphInfo
but also topoRanks
(so we cannot delete this allgather at once). It may seem like topoRanks
can be all-gathered by head of nodes only? But it will require inter-node bootstrap comm though. Not sure if it is worth it since creation of the inter-node comm also takes time.
Just wild thoughts. The team may have better ideas :) Thanks!
Ok, thanks for the feedback. 2.5 minutes seems indeed large to me, and it would be good to have some sort of timing statistics throughout ncclCommInit to find out what is taking time and how much it grows as we scale.
Concretely, the main reasons for time spend in ncclCommInit I can think of are:
So I'd really be curious to confirm what phase is growing in time as we scale. There is probably something I'm missing.
Thanks @sjeaugey for the breakdown.
We will try to gather detailed timing statistics (using TRACE=1
) and report back.
Meanwhile, FYI, we are making the nonblocking API mode the default in PyTorch: https://github.com/pytorch/pytorch/issues/117749
There is one thing I'd like to confirm:
ncclCommInitRankConfig
with timeout check loop. Instead, we will wrap the first collective. Is it correct understanding that the first collective would keep returning ncclInProgress
if the comm isn't ready at that point? (We would like to allow CPU go-ahead while not losing fault tolerance). If there are other things we should be aware of, we would appreciate your guidance too. Thanks!
I'm not sure it is valid to call ncclAllReduce
on a communicator which has not yet been initialized. In other words, I think that you are supposed to call ncclCommGetAsyncError
until it returns ncclSuccess
, then you are allowed to call ncclAllReduce
. Otherwise, you would not be able to know which operation has failed with a given error (if NCCL ends up returning an error).
CC @KaimingOuyang in case I'm missing something.
Yes. You have to wait until comm is fully initialized, otherwise ncclAllreduce would just fail.
Hmm, interesting, I thought in nonblocking mode, NCCL errors are allowed to be "sticky" (like CUDA), which means ncclAllreduce
can return errors from ncclCommInitRankConfig
.
I can add a manual ncclCommGetAsyncError
loop before the first collective call. But in general, this may not be enough, as no one knows what the first NCCL call would be.
I am curious what makes a better UI? An alternative to user-side check is for ncclAllreduce
to have a ncclCommGetAsyncError
check internally. If ncclAllreduce
has a side thread, it can block on that; but the main would continue to return ncclInProgress
or any error that has been observed so far.
The goal here is to allow the main thread to go ahead.
https://github.com/NVIDIA/nccl/blob/b6d7438d3145a619f924dbbca6c96db21fab716e/src/init.cc#L617 is called in a for loop: https://github.com/NVIDIA/nccl/blob/b6d7438d3145a619f924dbbca6c96db21fab716e/src/init.cc#L599
To reduce the number of bootstrapAllGather calls at scale, would it be better to break out of the for loop at bootstrapAllGather, pack data of all loops into one bootstrapAllGather, and re-enter the for loop after?