QMCPACK / qmcpack

Main repository for QMCPACK, an open-source production level many-body ab initio Quantum Monte Carlo code for computing the electronic structure of atoms, molecules, and solids with full performance portable GPU support
http://www.qmcpack.org
Other
300 stars 139 forks source link

Toxic Ctest behavior on multi gpu systems #5130

Open PDoakORNL opened 3 months ago

PDoakORNL commented 3 months ago

Describe the bug Don't export a CUDA_VISIBLE_DEVICES before you run ctest. I ran ctest on a 16gpu dgx server. It seems to run one copy of each test per GPU at least once it gets to the new drivers mpi tests. Unclear on whether they interfere with each each other i.e. in terms of pass fail no. test_new_driver_mpi-rx seems to have another issue that makes it take many many times longer to complete a test launched by ctest than run on its own with mpiexec on my machine.

This is pretty obnoxious default behavior, high GPU count nodes and workstations are pretty common place at this point. Is there anything we can do to have some default behavior that isn’t quite so toxic. Perhaps pick a sane test launch setup at configure time and print notice that if some other setup is desired what to do? I could see using all the GPU's but why run a instance of the test per GPU?

Exploring this problem further I tried


export CUDA_VISIBLE_DEVICES=9,10,11,12
ctest -R 'unit.*' -V
And this is what I saw while while the r2 test was running:
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    9   N/A  N/A   1536602      C   ...CDrivers/tests/test_new_drivers_mpi        404MiB |
|    9   N/A  N/A   1536603      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|   10   N/A  N/A   1536602      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|   10   N/A  N/A   1536603      C   ...CDrivers/tests/test_new_drivers_mpi        404MiB |
|   11   N/A  N/A   1536602      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|   11   N/A  N/A   1536603      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|   12   N/A  N/A   1536602      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|   12   N/A  N/A   1536603      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
+-----------------------------------------------------------------------------------------+
and during the r3 test
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    9   N/A  N/A   1537618      C   ...CDrivers/tests/test_new_drivers_mpi        404MiB |
|    9   N/A  N/A   1537619      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|    9   N/A  N/A   1537620      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|   10   N/A  N/A   1537618      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|   10   N/A  N/A   1537619      C   ...CDrivers/tests/test_new_drivers_mpi        404MiB |
|   10   N/A  N/A   1537620      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|   11   N/A  N/A   1537618      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|   11   N/A  N/A   1537619      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|   11   N/A  N/A   1537620      C   ...CDrivers/tests/test_new_drivers_mpi        404MiB |
|   12   N/A  N/A   1537618      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|   12   N/A  N/A   1537619      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|   12   N/A  N/A   1537620      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
+-----------------------------------------------------------------------------------------+
...
etc.

with
export CUDA_VISIBLE_DEVICES=1
during r4

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    9   N/A  N/A   1543180      C   ...CDrivers/tests/test_new_drivers_mpi        404MiB |
|    9   N/A  N/A   1543181      C   ...CDrivers/tests/test_new_drivers_mpi        404MiB |
|    9   N/A  N/A   1543182      C   ...CDrivers/tests/test_new_drivers_mpi        404MiB |
|    9   N/A  N/A   1543183      C   ...CDrivers/tests/test_new_drivers_mpi        404MiB |
+-----------------------------------------------------------------------------------------+

This is related to #3879 I think.

**To Reproduce**
Steps to reproduce the behavior:
don't export a reducing CUDA_VISIBLE_DRIVERS when many GPU are available.

**Expected behavior**
Even if many GPU's are available don't use more than you have ranks at the most agressive.

**System:**
 - sdgx-server.sns.gov

**Additional context**
spack openmpi 4.2.1
clang 18
ye-luo commented 3 months ago

How the GPU binding of MPI processes are managed by the MPI launcher (mpirun/mpiexec). ctest was not aware of MPI capability. Due to the variants of MPI libraries and subtle differences among machine configurations, we don't handle GPU affinity. I have not seen a clean way to make MPI and ctest interact regarding GPU affinity.

We do have some CPU logic processor binding control via PROCESSORS and PROCESSOR_AFFINITY test properties.

You may customize MPIRUN options to driver multiple GPUs although this is applied on very test which uses MPI.

I know ways to allow ctest to assign tests to different GPUs but I don't know simple ways to make ctest and MPI corporate and assigning some jobs to more GPUs and some jobs to use fewer ones.

ye-luo commented 3 months ago

CUDA_VISIBLE_DEVICES should not be an issue. Each test can only driver one GPU no matter how many GPUs are exposed unless your mpirun is smart and adjust CUDA_VISIBLE_DEVICES. When GPU features are on, we only run one test at a time to prevent over-subscription. I believe the issue you saw here is due to not using -j and all the MPI processes run on a single core. So please try -j16.

prckent commented 3 months ago

On our multi GPU machines the resource locking in cmake/ctest appears to work successfully. e.g. In nightlies we never use both cards on our dual MI210 machine.

It would be very beneficial to revisit the GPU resource locking so that we could use multi GPU machines fully and correctly. Other projects solve this with some extra scripting so that cmake knows the correct number of GPUs and then visibility is set via appropriate environment variables etc.

=> Most likely something is "special" in your current software setup. (?)