Closed yuema137 closed 2 months ago
Thanks Yue, this is really helpful!
The PR generally looks good, but the GPU can be tricky, in sbatch we need to explicitly ask general resources like GPU with #SBATCH --gres=gpu:1
and initialize cuda module load cuda
in the batch script as well, like in env_starter , and simple_slurm can handle input as well.
I tried starting a jupyter notebook with GPU on gpu2 using the env_starter branch, I can start a job with 28 CPUs, but no GPU access is given. Maybe we are not allocated to have midway GPUs anymore? If that's the case we don't need to implement my GPU gres comment and can directly merge the PR.
@shenyangshi So if I understand correctly, we haven't successfully used GPU even on gpu2 partition?
I think originally we could access gpu2, see slack from Andrii, now I'm not sure, I haven't successfully used it.
@shenyangshi I just tried with sbatch
directly and the GPU could work. So there are probably something missing in env_starter
. I will fix it and also add the GPU header here
Sounds good, thanks
@shenyangshi After diving into this, I realized that it's not trivial to set up the GPUs for utilix
and env_starter
. The reason is that on fried rice, the NVIDIA driver is much more advanced and so does the CUDA version (12.4
)
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA TITAN V Off | 00000000:86:00.0 Off | N/A |
| 28% 38C P8 26W / 250W | 1992MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA TITAN V Off | 00000000:AF:00.0 Off | N/A |
| 28% 36C P8 25W / 250W | 2152MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
Therefore, the tensorflow
version was upgraded to 2.15
to be compatible with the fried rice GPUs. However, on midway the NVIDIA driver is quite outdated (very likely not being maintained anymore), which doesn't meet the requirement for tensorflow 2.15
. So, if we really want this resource, we need to:
gpu2
partition as a CPU-only resource. What do you think?
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:09:00.0 Off | 0 |
| N/A 32C P8 27W / 149W | 0MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Thanks Yue for the hard work and detailed check! I totally agree we can use it as a CPU-only node now.
As @shenyangshi mentioned in #115, there are more possible partitions than what we previously listed in
utilix.batchq
and the actual available onesbigmem2
andgpu2
are added here (see Yue's reponse in #115)Besides, the unit test for
batchq
has also been updated to ensure that all of the partitions listed here are actually supported. I have tested onmidway2
,midway3
anddali
and the tests were all passed.