K3S - Failed to start plugin: error waiting for MPS daemon

FrsECM commented 3 months ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

OS/Version: Ubuntu 20.04.6 LTS
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K3S Rancher

2. Issue or feature description

I have installed and configured GPU operator v23.9.2

I updated k8s-device-plugin to the v0.15.0 by editing the yaml template.

...
devicePlugin:
enabled: true
repository: nvcr.io/nvidia
image: k8s-device-plugin
version: "v0.15.0"
imagePullPolicy: IfNotPresent
env: 
  - name: MPS_ROOT
    value: "/run/nvidia/mps"
...

I created a "sharing" config map, including MPS and timeslicing config, in order to switch from one to the other :

apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-sharing-config
  namespace: gpu-operator
data:
    a6000-ts-6: |-
        version: v1
        sharing:
          timeSlicing:
            resources:
            - name: nvidia.com/gpu
              replicas: 6
    a6000-mps-4: |-
        version: v1
        sharing:
          mps:
            resources:
            - name: nvidia.com/gpu
              replicas: 8

There is absolutely no issue with the time slicing :

kubectl label node mitcv01 nvidia.com/device-plugin.config=a6000-ts-4 --overwrite
kubectl rollout restart -n gpu-operator daemonset/nvidia-device-plugin-daemonset
# Logs : 
# I0516 13:17:50.681437      30 main.go:279] Retrieving plugins.
# I0516 13:17:50.682823      30 factory.go:104] Detected NVML platform: found NVML library
# I0516 13:17:50.682909      30 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
# I0516 13:17:50.724323      30 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
# I0516 13:17:50.725710      30 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
# I0516 13:17:50.729389      30 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet

But if i want to use MPS, i have this issue :

kubectl label node mitcv01 nvidia.com/device-plugin.config=a6000-mps-4 --overwrite
kubectl rollout restart -n gpu-operator daemonset/nvidia-device-plugin-daemonset

# Logs : 
# I0516 13:19:07.992446      31 main.go:279] Retrieving plugins.
# I0516 13:19:07.993340      31 factory.go:104] Detected NVML platform: found NVML library
# I0516 13:19:07.993402      31 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
# E0516 13:19:08.046087      31 main.go:301] Failed to start plugin: error waiting for MPS daemon: error checking MPS daemon health: # failed to send command to MPS daemon: exit status 1
# I0516 13:19:08.046116      31 main.go:208] Failed to start one or more plugins. Retrying in 30s...

Can you help me to figure out what i did wrong ? Thanks,

3. Information to attach (optional if deemed irrelevant)

Common error checking:

[ ] The output of nvidia-smi -a on your host
[x] The k8s-device-plugin container logs
[ ] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

[x] Docker version from docker version

(base) ➜  $✗ nvidia-docker --version
Docker version 25.0.3, build 4debf41

[x] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'

```bash ||/ Name Version Architecture Description +++-====================================-===================-============-========================================================= un libgldispatch0-nvidia (no description available) ii libnvidia-cfg1-550:amd64 550.54.15-0ubuntu1 amd64 NVIDIA binary OpenGL/GLX configuration library un libnvidia-cfg1-any (no description available) un libnvidia-common (no description available) ii libnvidia-common-530 550.54.15-0ubuntu1 all Transitional package for libnvidia-common-550 ii libnvidia-common-550 550.54.15-0ubuntu1 all Shared files used by the NVIDIA libraries un libnvidia-compute (no description available) rc libnvidia-compute-515:amd64 515.105.01-0ubuntu1 amd64 NVIDIA libcompute package ii libnvidia-compute-530:amd64 550.54.15-0ubuntu1 amd64 Transitional package for libnvidia-compute-550 ii libnvidia-compute-550:amd64 550.54.15-0ubuntu1 amd64 NVIDIA libcompute package ii libnvidia-container-tools 1.15.0-1 amd64 NVIDIA container runtime library (command-line tools) ii libnvidia-container1:amd64 1.15.0-1 amd64 NVIDIA container runtime library un libnvidia-decode (no description available) ii libnvidia-decode-530:amd64 550.54.15-0ubuntu1 amd64 Transitional package for libnvidia-decode-550 ii libnvidia-decode-550:amd64 550.54.15-0ubuntu1 amd64 NVIDIA Video Decoding runtime libraries un libnvidia-encode (no description available) ii libnvidia-encode-530:amd64 550.54.15-0ubuntu1 amd64 Transitional package for libnvidia-encode-550 ii libnvidia-encode-550:amd64 550.54.15-0ubuntu1 amd64 NVENC Video Encoding runtime library un libnvidia-extra (no description available) ii libnvidia-extra-550:amd64 550.54.15-0ubuntu1 amd64 Extra libraries for the NVIDIA driver un libnvidia-fbc1 (no description available) ii libnvidia-fbc1-530:amd64 550.54.15-0ubuntu1 amd64 Transitional package for libnvidia-fbc1-550 ii libnvidia-fbc1-550:amd64 550.54.15-0ubuntu1 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library un libnvidia-gl (no description available) ii libnvidia-gl-530:amd64 550.54.15-0ubuntu1 amd64 Transitional package for libnvidia-gl-550 ii libnvidia-gl-550:amd64 550.54.15-0ubuntu1 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD un libnvidia-ml.so.1 (no description available) un libnvidia-ml1 (no description available) un nvidia-384 (no description available) un nvidia-390 (no description available) un nvidia-common (no description available) un nvidia-compute-utils (no description available) rc nvidia-compute-utils-515 515.105.01-0ubuntu1 amd64 NVIDIA compute utilities ii nvidia-compute-utils-530:amd64 550.54.15-0ubuntu1 amd64 Transitional package for nvidia-compute-utils-550 ii nvidia-compute-utils-550 550.54.15-0ubuntu1 amd64 NVIDIA compute utilities un nvidia-container-runtime (no description available) un nvidia-container-runtime-hook (no description available) ii nvidia-container-toolkit 1.15.0-1 amd64 NVIDIA Container toolkit ii nvidia-container-toolkit-base 1.15.0-1 amd64 NVIDIA Container Toolkit Base rc nvidia-dkms-515 515.105.01-0ubuntu1 amd64 NVIDIA DKMS package ii nvidia-dkms-530 550.54.15-0ubuntu1 amd64 Transitional package for nvidia-dkms-550 ii nvidia-dkms-550 550.54.15-0ubuntu1 amd64 NVIDIA DKMS package un nvidia-dkms-kernel (no description available) un nvidia-docker (no description available) ii nvidia-docker2 2.13.0-1 all nvidia-docker CLI wrapper ii nvidia-driver-530 550.54.15-0ubuntu1 amd64 Transitional package for nvidia-driver-550 ii nvidia-driver-550 550.54.15-0ubuntu1 amd64 NVIDIA driver metapackage un nvidia-driver-binary (no description available) un nvidia-fabricmanager (no description available) rc nvidia-fabricmanager-515 515.105.01-1 amd64 Fabric Manager for NVSwitch based systems. ii nvidia-fabricmanager-530 530.30.02-1 amd64 Fabric Manager for NVSwitch based systems. rc nvidia-fabricmanager-550 550.54.15-1 amd64 Fabric Manager for NVSwitch based systems. ii nvidia-firmware-550-550.54.15 550.54.15-0ubuntu1 amd64 Firmware files used by the kernel module un nvidia-firmware-550-server-550.54.15 (no description available) un nvidia-kernel-common (no description available) rc nvidia-kernel-common-515 515.105.01-0ubuntu1 amd64 Shared files used with the kernel module ii nvidia-kernel-common-530:amd64 550.54.15-0ubuntu1 amd64 Transitional package for nvidia-kernel-common-550 ii nvidia-kernel-common-550 550.54.15-0ubuntu1 amd64 Shared files used with the kernel module un nvidia-kernel-open-515 (no description available) un nvidia-kernel-open-530 (no description available) un nvidia-kernel-source (no description available) un nvidia-kernel-source-515 (no description available) ii nvidia-kernel-source-530 550.54.15-0ubuntu1 amd64 Transitional package for nvidia-kernel-source-550 ii nvidia-kernel-source-550 550.54.15-0ubuntu1 amd64 NVIDIA kernel source package un nvidia-legacy-304xx-vdpau-driver (no description available) un nvidia-legacy-340xx-vdpau-driver (no description available) ii nvidia-modprobe 550.54.15-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files un nvidia-opencl-icd (no description available) un nvidia-persistenced (no description available) ii nvidia-prime 0.8.16~0.20.04.2 all Tools to enable NVIDIA's Prime ii nvidia-settings 550.54.15-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver un nvidia-settings-binary (no description available) un nvidia-smi (no description available) un nvidia-utils (no description available) ii nvidia-utils-530:amd64 550.54.15-0ubuntu1 amd64 Transitional package for nvidia-utils-550 ii nvidia-utils-550 550.54.15-0ubuntu1 amd64 NVIDIA driver support binaries un nvidia-vdpau-driver (no description available) ii xserver-xorg-video-nvidia-530:amd64 550.54.15-0ubuntu1 amd64 Transitional package for xserver-xorg-video-nvidia-550 ii xserver-xorg-video-nvidia-550 550.54.15-0ubuntu1 amd64 NVIDIA binary Xorg driver ```

[x] NVIDIA container library version from nvidia-container-cli -V

nvidia-container-cli -V
cli-version: 1.15.0
lib-version: 1.15.0
build date: 2024-04-15T13:36+00:00
build revision: 6c8f1df7fd32cea3280cf2a2c6e931c9b3132465
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64

elezar commented 3 months ago

Which driver version are you using?

Does the log of the mps-control-daemon-ctr show any additional output?

elezar commented 3 months ago

Also to clarify. Is the device plugin deployed using the GPU operator or using the standalone helm chart?

FrsECM commented 3 months ago

Which driver version are you using?

Does the log of the mps-control-daemon-ctr show any additional output?

I'am using version 550 of the driver. I don't have mps-control-daemon-ctr, maybe the problem is here ! Do you have a template to install it without helm ?

At the beggining i was using the plugin deployed with GPU operator (v23.9.2) but i manually overrided the yaml in order to target k8s-device-plugin v0.15.0 instead of v0.14.

FrsECM commented 3 months ago

I did the installation of the control daemon as an "extra", it's now up and running. I did it with this template :

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvdp-nvidia-device-plugin-mps-control-daemon
  namespace: gpu-operator
  labels:
    helm.sh/chart: nvidia-device-plugin-0.15.0
    app.kubernetes.io/name: nvidia-device-plugin
    app.kubernetes.io/instance: nvdp
    app.kubernetes.io/version: "0.15.0"
    app.kubernetes.io/managed-by: Helm
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: nvidia-device-plugin
      app.kubernetes.io/instance: nvdp
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        app.kubernetes.io/name: nvidia-device-plugin
        app.kubernetes.io/instance: nvdp
      annotations:
        {}
    spec:
      priorityClassName: system-node-critical
      securityContext:
        {}
      initContainers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
        name: mps-control-daemon-mounts
        command: [mps-control-daemon, mount-shm]
        securityContext:
          privileged: true
        volumeMounts:
        - name: mps-root
          mountPath: /mps
          mountPropagation: Bidirectional
      containers:
        - image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
          imagePullPolicy: IfNotPresent
          name: mps-control-daemon-ctr
          command: [mps-control-daemon]
          env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: spec.nodeName
          - name: NVIDIA_MIG_MONITOR_DEVICES
            value: all
          - name: NVIDIA_VISIBLE_DEVICES
            value: all
          - name: NVIDIA_DRIVER_CAPABILITIES
            value: compute,utility
          securityContext:
            privileged: true
          volumeMounts:
          - name: mps-shm
            mountPath: /dev/shm
          - name: mps-root
            mountPath: /mps
      volumes:
      - name: mps-root
        hostPath:
          path: /run/nvidia/mps
          type: DirectoryOrCreate
      - name: mps-shm
        hostPath:
          path: /run/nvidia/mps/shm
      nodeSelector:
        # We only deploy this pod if the following sharing label is applied.
        nvidia.com/mps.capable: "true"
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: feature.node.kubernetes.io/pci-10de.present
                operator: In
                values:
                - "true"
            - matchExpressions:
              - key: feature.node.kubernetes.io/cpu-model.vendor_id
                operator: In
                values:
                - NVIDIA
            - matchExpressions:
              - key: nvidia.com/gpu.present
                operator: In
                values:
                - "true"
      tolerations:
        - key: CriticalAddonsOnly
          operator: Exists
        - effect: NoSchedule
          key: nvidia.com/gpu
          operator: Exists

I also modified the node labels in order to say that mps is enabled :

kubectl label node mitcv01 nvidia.com/mps.capable="true" --overwrite

The daemon starts, but it says that a "strategy" is missing :

How can i update this strategy and setup mps-control to use the same configmap as device plugin ?

kubectl patch clusterpolicy/cluster-policy \
   -n gpu-operator --type merge \
   -p '{"spec": {"devicePlugin": {"config": {"name": "nvidia-sharing-config"}}}}'

elezar commented 3 months ago

You need to supply the same config map / name as for the device plugin. There is also a sidecar that ensures the config is up to date in the same way that the device plugin / gfd does.

Is there a reason that you don't skip the installation of the device plugin in the operator and deploy that using helm? See for example: https://docs.google.com/document/d/1H-ddA11laPQf_1olwXRjEDbzNihxprjPr74pZ4Vdf2M/edit#heading=h.9odbb6smrel8

FrsECM commented 3 months ago

Great document, thanks a lot !

It was because of a lack of knowledge about how to pass the configuration to the plugin. It seems it works now thanks to your very helpfull document !

Finally i did :

helm install --dry-run gpu-operator --wait -n gpu-operator --create-namespace \
nvidia/gpu-operator --version v23.9.2 \
--set nfd.enabled=false \
--set devicePlugin.enabled=false \
--set gfd.enabled=false \
--set toolkit.enabled=false > nvidia-gpu-operator.yaml

Then for installing MPS :

helm upgrade -i nvdp nvdp/nvidia-device-plugin \
    --version=0.15.0 \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --set gfd.enabled=true \
    --set config.default=nvidia-sharing \
    --set-file config.map.nvidia-sharing=config/nvidia/config/dp-mps-6.yaml

Thanks again for your help.

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

NVIDIA / k8s-device-plugin