ROCm / rccl-tests

RCCL Performance Benchmark Tests
Other
52 stars 40 forks source link

Multi-GPU Support with External Pinning #42

Open frobnitzem opened 1 year ago

frobnitzem commented 1 year ago

In my HPC environment, srun accomplishes pinning of MPI ranks to specific cores and GPU-s (by setting ROCR_VISIBLE_DEVICES). However, this conflicts with rccl-tests, which tries to manually select GPUs based on the MPI rank.

I have fixed this in my own build (https://github.com/frobnitzem/rccl-tests/commit/5b347ee66b2e86f1ed2e9affc13d8d562cada1d0) by always running the step gpuid = gpuid % args->localNumDevices, regardless of whether args->enable_multiranks is true or not.

I suggest adopting this change, and reverting the update: https://github.com/ROCmSoftwarePlatform/rccl-tests/commit/d16d1fb16b2abe1c1c88464097e6f1d8070d1116 which throws an error in this case instead.

huanrwan-amd commented 21 hours ago

Hi @frobnitzem, sorry for the late reply. And thank you for your contribution. If possible, can you raise a PR for this change? It would be useful for an auto-correction. For rccl-tests codes, we would like some input from the team. Here is the list of reviewers: https://github.com/orgs/ROCm/teams/rccl-reviewers