Open rohany opened 2 months ago
@rohany Thanks for the info! We will change to use concurrent tasks. To questsions:
checkCUDA(cudaDeviceSynchronize())
after runtime->concurrent_task_barrier(ctx)
?Why you added a checkCUDA(cudaDeviceSynchronize()) after runtime->concurrent_task_barrier(ctx)?
I think that this is unnecessary, i had added it a while ago and never ended up removing it. The ground truth for what to do is Mike's example here: https://gitlab.com/StanfordLegion/legion/-/blob/master/examples/nccl/nccl_legion_demo.cu?ref_type=heads.
Were you able to resolve the hanging issue after the changes?
yes, i was able to scale to 32 GPUs without hangs.
Flexflow is currently protecting all tasks with NCCL with an execution fence, which is correct, but unnecessary. Concurrent tasks do not need a fence, but must register themselves as concurrent, and use the concurrent barrier before and after the allreduce. This enables some overlap of the NCCL operations with other GPU operations without violating NCCL rules.
See some of these commits for details: https://github.com/rohany/FlexFlow/commit/cc3f0e00134b0a39c9f667b4f98f0f1250d10e89 https://github.com/rohany/FlexFlow/commit/ed08e74a4c3c285d09b8bf3a74fcba6728f4c05a https://github.com/rohany/FlexFlow/commit/2c1ceb2b34f2d3813a767d4c279f59e28b7e6e56