Open RonanQuigley opened 3 months ago
I haven't read your issue in detail, but maybe this will help: https://docs.google.com/document/d/1H-ddA11laPQf_1olwXRjEDbzNihxprjPr74pZ4Vdf2M/edit
Furthermore, the presence of the nvidia.com/mps.capable=true label triggers the creation of a daemonset to manage the MPS control daemon.
Thanks, so I did read this doc before posting the issue. The problem is that this never happens.
So I don't know why, but if I reboot the offending machines after enabling MPS via the config map then the mps control daemon pods startup.
It'd be good to get to the bottom of why this is, as it took me hours to figure this out plus others might be having the same problem. Any ideas on what I can look at?
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
1. Quick Debug Information
2. Issue or feature description
I'm struggling to understand how to enable MPS with the provided README . I'm using helm chart version 0.15.0. I'm using the nvidia device plugin helm chart. I'm not using the gpu-operator chart.
Am I supposed to do something after enabling mps via the config map? I've also tried going onto the relevant gpu worker node and enabling mps via
nvidia-cuda-mps-control -d
but that made no difference.Logs from the
nvidia-device-plugin-ctr
container in thenvidia-device-plugin
pod:Additional information that might help better understand your environment and reproduce the bug: