Open frobnitzem opened 1 year ago
Hi @frobnitzem, sorry for the late reply. And thank you for your contribution. If possible, can you raise a PR for this change? It would be useful for an auto-correction. For rccl-tests codes, we would like some input from the team. Here is the list of reviewers: https://github.com/orgs/ROCm/teams/rccl-reviewers
In my HPC environment, srun accomplishes pinning of MPI ranks to specific cores and GPU-s (by setting ROCR_VISIBLE_DEVICES). However, this conflicts with rccl-tests, which tries to manually select GPUs based on the MPI rank.
I have fixed this in my own build (https://github.com/frobnitzem/rccl-tests/commit/5b347ee66b2e86f1ed2e9affc13d8d562cada1d0) by always running the step
gpuid = gpuid % args->localNumDevices
, regardless of whetherargs->enable_multiranks
is true or not.I suggest adopting this change, and reverting the update: https://github.com/ROCmSoftwarePlatform/rccl-tests/commit/d16d1fb16b2abe1c1c88464097e6f1d8070d1116 which throws an error in this case instead.