FlexFlow should correctly use concurrent tasks

rohany commented 2 months ago

Flexflow is currently protecting all tasks with NCCL with an execution fence, which is correct, but unnecessary. Concurrent tasks do not need a fence, but must register themselves as concurrent, and use the concurrent barrier before and after the allreduce. This enables some overlap of the NCCL operations with other GPU operations without violating NCCL rules.

See some of these commits for details: https://github.com/rohany/FlexFlow/commit/cc3f0e00134b0a39c9f667b4f98f0f1250d10e89 https://github.com/rohany/FlexFlow/commit/ed08e74a4c3c285d09b8bf3a74fcba6728f4c05a https://github.com/rohany/FlexFlow/commit/2c1ceb2b34f2d3813a767d4c279f59e28b7e6e56

jiazhihao commented 2 months ago

@rohany Thanks for the info! We will change to use concurrent tasks. To questsions:

Why you added a checkCUDA(cudaDeviceSynchronize()) after runtime->concurrent_task_barrier(ctx)?
Were you able to resolve the hanging issue after the changes?

rohany commented 2 months ago

Why you added a checkCUDA(cudaDeviceSynchronize()) after runtime->concurrent_task_barrier(ctx)?

I think that this is unnecessary, i had added it a while ago and never ended up removing it. The ground truth for what to do is Mike's example here: https://gitlab.com/StanfordLegion/legion/-/blob/master/examples/nccl/nccl_legion_demo.cu?ref_type=heads.

Were you able to resolve the hanging issue after the changes?

yes, i was able to scale to 32 GPUs without hangs.

flexflow / FlexFlow

FlexFlow should correctly use concurrent tasks #1378