ComputeCanada / puppet-magic_castle

Puppet Environment repo for Magic Castle - https://github.com/ComputeCanada/magic_castle
MIT License
13 stars 21 forks source link

Investigate nvidia_drm preventing gpu reset when applying mig profile #387

Open cmd-ntrf opened 1 month ago

cmd-ntrf commented 1 month ago

mig-parted apply returns the following error in some circumstances:

time="2024-09-30T19:49:46Z" level=error msg="\nThe following GPUs could not be reset:\n  GPU 00000000:00:06.0: In use by another client\n\n1 device is currently being used by one or more other processes (e.g., Fabric Manager, CUDA application, graphics application such as an X server, or a monitoring application such as another instance of nvidia-smi). Please first kill all processes using this device and all compute applications running in the system.\n"

There are no other services or processes that uses the GPU, but calling sudo modprobe -r nvidia_drm allows mig-parted to run after. Given that DRM stands for Direct Rendering Manager, I am not sure we need this kernel module.

bartoldeman commented 1 month ago

As far as I know DRM and /dev/dri/cardX devices are used by EGL, and hence by VirtualGL if you want to run something on a compute node that renders via the GPU.