Open PDoakORNL opened 3 months ago
How the GPU binding of MPI processes are managed by the MPI launcher (mpirun/mpiexec). ctest was not aware of MPI capability. Due to the variants of MPI libraries and subtle differences among machine configurations, we don't handle GPU affinity. I have not seen a clean way to make MPI and ctest interact regarding GPU affinity.
We do have some CPU logic processor binding control via PROCESSORS
and PROCESSOR_AFFINITY
test properties.
You may customize MPIRUN options to driver multiple GPUs although this is applied on very test which uses MPI.
I know ways to allow ctest to assign tests to different GPUs but I don't know simple ways to make ctest and MPI corporate and assigning some jobs to more GPUs and some jobs to use fewer ones.
CUDA_VISIBLE_DEVICES should not be an issue. Each test can only driver one GPU no matter how many GPUs are exposed unless your mpirun is smart and adjust CUDA_VISIBLE_DEVICES. When GPU features are on, we only run one test at a time to prevent over-subscription. I believe the issue you saw here is due to not using -j
and all the MPI processes run on a single core. So please try -j16
.
On our multi GPU machines the resource locking in cmake/ctest appears to work successfully. e.g. In nightlies we never use both cards on our dual MI210 machine.
It would be very beneficial to revisit the GPU resource locking so that we could use multi GPU machines fully and correctly. Other projects solve this with some extra scripting so that cmake knows the correct number of GPUs and then visibility is set via appropriate environment variables etc.
=> Most likely something is "special" in your current software setup. (?)
Describe the bug Don't export a CUDA_VISIBLE_DEVICES before you run ctest. I ran ctest on a 16gpu dgx server. It seems to run one copy of each test per GPU at least once it gets to the new drivers mpi tests. Unclear on whether they interfere with each each other i.e. in terms of pass fail no. test_new_driver_mpi-rx seems to have another issue that makes it take many many times longer to complete a test launched by ctest than run on its own with mpiexec on my machine.
This is pretty obnoxious default behavior, high GPU count nodes and workstations are pretty common place at this point. Is there anything we can do to have some default behavior that isn’t quite so toxic. Perhaps pick a sane test launch setup at configure time and print notice that if some other setup is desired what to do? I could see using all the GPU's but why run a instance of the test per GPU?
Exploring this problem further I tried