NordicHPC / sonar

Tool to profile usage of HPC resources by regularly probing processes.
GNU General Public License v3.0
8 stars 5 forks source link

Sonar must deal with nvidia-smi being available on non-gpu nodes #188

Closed lars-t-hansen closed 1 month ago

lars-t-hansen commented 1 month ago

Related to #87 clearly and even more to #44.

On the new Betzy OS, the same image is used everywhere so nvidia-smi is available everywhere, causing Sonar to report gpufail=1 on all the non-GPU nodes b/c it is able to run nvidia-smi but can't parse the output. The output is this:

[...@b1101.BETZY ~]$ nvidia-smi 
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

To do better, we can't use the presence of nvidia-smi as a flag that GPUs are available, but must look for drivers or devices. There's /sys/module/nvidia_uvm which might indicate presence of an NVIDIA GPU and /sys/module/amdgpu which might indicate presence of an AMD GPU, but there may be something better.

Dependencies:

lars-t-hansen commented 1 month ago

/sys/module/nvidia is another one.

lars-t-hansen commented 1 month ago

Or /proc/driver, for NVIDIA at least: https://download.nvidia.com/XFree86/Linux-x86_64/435.17/README/procinterface.html

lars-t-hansen commented 1 month ago

So far so good:

Fox gpu-1: has /proc/driver/nvidia and /sys/module/nvidia. Fox c1-10: does not have those files Betzy gpu node has /sys/module/nvidia Betzy compute node does not have /sys/module/nvidia ML4: has /sys/module/amdgpu and not /sys/module/nvidia ML1: has /sys/module/nvidia

I'm trying to find out about Lumi, it should (ideally) be like ML4. The idea isn't that we're trying to run Sonar on Lumi, but that there's a reliable pattern, undocumented though it may be.

lars-t-hansen commented 1 month ago

Marcin confirms that Lumi is like ML4.

Probably this is not the last word, but it'll do for now. We'll add gpu detection for #44 and that will also fix this bug.

lars-t-hansen commented 1 month ago

Need to deploy to Betzy but that's handled outside this bug.