Closed lars-t-hansen closed 1 month ago
/sys/module/nvidia is another one.
Or /proc/driver, for NVIDIA at least: https://download.nvidia.com/XFree86/Linux-x86_64/435.17/README/procinterface.html
So far so good:
Fox gpu-1: has /proc/driver/nvidia and /sys/module/nvidia. Fox c1-10: does not have those files Betzy gpu node has /sys/module/nvidia Betzy compute node does not have /sys/module/nvidia ML4: has /sys/module/amdgpu and not /sys/module/nvidia ML1: has /sys/module/nvidia
I'm trying to find out about Lumi, it should (ideally) be like ML4. The idea isn't that we're trying to run Sonar on Lumi, but that there's a reliable pattern, undocumented though it may be.
Marcin confirms that Lumi is like ML4.
Probably this is not the last word, but it'll do for now. We'll add gpu detection for #44 and that will also fix this bug.
Need to deploy to Betzy but that's handled outside this bug.
Related to #87 clearly and even more to #44.
On the new Betzy OS, the same image is used everywhere so nvidia-smi is available everywhere, causing Sonar to report gpufail=1 on all the non-GPU nodes b/c it is able to run nvidia-smi but can't parse the output. The output is this:
To do better, we can't use the presence of nvidia-smi as a flag that GPUs are available, but must look for drivers or devices. There's /sys/module/nvidia_uvm which might indicate presence of an NVIDIA GPU and /sys/module/amdgpu which might indicate presence of an AMD GPU, but there may be something better.
Dependencies: