Add more partitions and update unit test

yuema137 commented 2 months ago

As @shenyangshi mentioned in #115, there are more possible partitions than what we previously listed in utilix.batchq and the actual available ones bigmem2 and gpu2 are added here (see Yue's reponse in #115)

Besides, the unit test for batchq has also been updated to ensure that all of the partitions listed here are actually supported. I have tested on midway2, midway3 and dali and the tests were all passed.

shenyangshi commented 2 months ago

Thanks Yue, this is really helpful!

The PR generally looks good, but the GPU can be tricky, in sbatch we need to explicitly ask general resources like GPU with #SBATCH --gres=gpu:1 and initialize cuda module load cuda in the batch script as well, like in env_starter , and simple_slurm can handle input as well.

I tried starting a jupyter notebook with GPU on gpu2 using the env_starter branch, I can start a job with 28 CPUs, but no GPU access is given. Maybe we are not allocated to have midway GPUs anymore? If that's the case we don't need to implement my GPU gres comment and can directly merge the PR.

yuema137 commented 2 months ago

@shenyangshi So if I understand correctly, we haven't successfully used GPU even on gpu2 partition?

shenyangshi commented 2 months ago

I think originally we could access gpu2, see slack from Andrii, now I'm not sure, I haven't successfully used it.

yuema137 commented 2 months ago

@shenyangshi I just tried with sbatch directly and the GPU could work. So there are probably something missing in env_starter. I will fix it and also add the GPU header here

shenyangshi commented 2 months ago

Sounds good, thanks

yuema137 commented 2 months ago

@shenyangshi After diving into this, I realized that it's not trivial to set up the GPUs for utilix and env_starter. The reason is that on fried rice, the NVIDIA driver is much more advanced and so does the CUDA version (12.4)

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA TITAN V                 Off |   00000000:86:00.0 Off |                  N/A |
| 28%   38C    P8             26W /  250W |    1992MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA TITAN V                 Off |   00000000:AF:00.0 Off |                  N/A |
| 28%   36C    P8             25W /  250W |    2152MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Therefore, the tensorflow version was upgraded to 2.15 to be compatible with the fried rice GPUs. However, on midway the NVIDIA driver is quite outdated (very likely not being maintained anymore), which doesn't meet the requirement for tensorflow 2.15. So, if we really want this resource, we need to:

Ask RCC people to update the NVIDIA driver and install new versions of CUDA

Or, create a special environment for GPUs on midway My feeling is that the complexity is quite a lot, while the gain is limited, as the fried rice ones are adequate now. So I think it's fine to keep gpu2 partition as a CPU-only resource. What do you think?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:09:00.0 Off |                    0 |
| N/A   32C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

shenyangshi commented 2 months ago

Thanks Yue for the hard work and detailed check! I totally agree we can use it as a CPU-only node now.

XENONnT / utilix

Add more partitions and update unit test #116