argonne-lcf / GettingStarted

Collection of small examples for running on ALCF resources
16 stars 7 forks source link

Fix GPU affinity assignment due to node topology. #13

Closed ye-luo closed 1 year ago

ye-luo commented 1 year ago

This place was referenced in a few pages.

felker commented 1 year ago

I was actually completely unaware of this device affinity ordering until stumbling upon your PR @ye-luo. I need to rerun my single node performance tests on Polaris now, since it might make a significant difference in my applications. I suspect many other users are also still using

gpu=$((${PMI_LOCAL_RANK} % ${num_gpus}))

in their scripts. And even in the summer pre-production many people were using this ordering e.g. @zippylab --- when was the mixup discovered? How unconventional is the current CPU core and GPU device affinity ordering? Seems strange to me to ask the user to manually flip the MPI local ranks on a node