breuleux / voir

Modular general instrumentation for Python
0 stars 1 forks source link

voir not detecting GPUs correctly under cgroups restrictions #10

Open ppomorsk opened 8 months ago

ppomorsk commented 8 months ago

On our clusters (narval at Calcul Quebec for example) a job can be submitted which is assigned only one of the multiple GPUs on a compute node. Using cgroups, we restrict that job so that only that single GPU is visible to nvidia-smi, and we naturally define CUDA_VISIBLE_DEVICES=0 in the job environment by default.

However, if the GPU Minor Number does not happen to be zero, voir will not detect it. If CUDA_VISIBLE_DEVICES is unset, or if it is set to GPU's Minor Number or UUID, then voir is able to detect the GPU.

It would be useful if voir was able to detect the GPU in such cases even if CUDA_VISIBLE_DEVICES=0 is set.

Here is the output from a single GPU interactive job launched using salloc.

[ppomorsk@ng20601 ~]$ echo $CUDA_VISIBLE_DEVICES 0 [ppomorsk@ng20601 ~]$ python -c 'from voir.instruments.gpu import get_gpu_info; print(get_gpu_info()["gpus"].values())' dict_values([]) [ppomorsk@ng20601 ~]$ unset CUDA_VISIBLE_DEVICES [ppomorsk@ng20601 ~]$ python -c 'from voir.instruments.gpu import get_gpu_info; print(get_gpu_info()["gpus"].values())' dict_values([{'device': '2', 'product': 'NVIDIA A100-SXM4-40GB', 'memory': {'used': 625.25, 'total': 40960.0}, 'utilization': {'compute': 0, 'memory': 0.015264892578125}, 'temperature': 30, 'power': 52.977, 'selection_variable': 'CUDA_VISIBLE_DEVICES'}]) [ppomorsk@ng20601 ~]$ nvidia-smi -q | grep UUID GPU UUID : GPU-97364847-2375-f0fe-7958-5c43a02d95ad [ppomorsk@ng20601 ~]$ export CUDA_VISIBLE_DEVICES=GPU-97364847-2375-f0fe-7958-5c43a02d95ad [ppomorsk@ng20601 ~]$ python -c 'from voir.instruments.gpu import get_gpu_info; print(get_gpu_info()["gpus"].values())' dict_values([{'device': '2', 'product': 'NVIDIA A100-SXM4-40GB', 'memory': {'used': 625.25, 'total': 40960.0}, 'utilization': {'compute': 0, 'memory': 0.015264892578125}, 'temperature': 30, 'power': 52.977, 'selection_variable': 'CUDA_VISIBLE_DEVICES'}]) [ppomorsk@ng20601 ~]$ nvidia-smi -q | grep Minor Minor Number : 2 [ppomorsk@ng20601 ~]$ export CUDA_VISIBLE_DEVICES=2 [ppomorsk@ng20601 ~]$ python -c 'from voir.instruments.gpu import get_gpu_info; print(get_gpu_info()["gpus"].values())' dict_values([{'device': '2', 'product': 'NVIDIA A100-SXM4-40GB', 'memory': {'used': 625.25, 'total': 40960.0}, 'utilization': {'compute': 0, 'memory': 0.015264892578125}, 'temperature': 30, 'power': 52.916, 'selection_variable': 'CUDA_VISIBLE_DEVICES'}])

mboisson commented 8 months ago

For what it's worth, nvidia-smi --id=0 works correctly within the job's cgroup even when the assigned GPU is not GPU0 on the host, and nvidia-smi --id=1 correctly does not work when only one GPU is assigned to the cgroup, regardless of the GPU that is assigned (only --id=0 is actually available).

Delaunay commented 1 month ago

It should have been fixed in https://github.com/breuleux/voir/commit/405c8a59d3622dfd94079c2dc06388c5fd3bd3fd Using the minor number was a problem.