NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.71k stars 614 forks source link

Failed to send command to MPS daemon #762

Open RonanQuigley opened 3 months ago

RonanQuigley commented 3 months ago

1. Quick Debug Information

2. Issue or feature description

I'm struggling to understand how to enable MPS with the provided README . I'm using helm chart version 0.15.0. I'm using the nvidia device plugin helm chart. I'm not using the gpu-operator chart.

Am I supposed to do something after enabling mps via the config map? I've also tried going onto the relevant gpu worker node and enabling mps via nvidia-cuda-mps-control -d but that made no difference.

[2024-06-10 15:16:40.777 Control 111377] Starting control daemon using socket /tmp/nvidia-mps/control
[2024-06-10 15:16:40.777 Control 111377] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps

Logs from the nvidia-device-plugin-ctr container in the nvidia-device-plugin pod:

Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "mpsRoot": "/run/nvidia/mps",
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {},
    "mps": {
      "failRequestsGreaterThanOne": true,
      "resources": [
        {
          "name": "nvidia.com/gpu",
          "devices": "all",
          "replicas": 20
        }
      ]
    }
  }
}
I0610 15:26:41.022164      39 main.go:279] Retrieving plugins.
I0610 15:26:41.022191      39 factory.go:104] Detected NVML platform: found NVML library
I0610 15:26:41.022226      39 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0610 15:26:41.076279      39 main.go:301] Failed to start plugin: error waiting for MPS daemon: error checking MPS daemon health: failed to send command to MPS daemon: exit status 1
I0610 15:26:41.076311      39 main.go:208] Failed to start one or more plugins. Retrying in 30s...
# values.yaml
nodeSelector: {
  nvidia.com/gpu: "true"
}

gfd: 
  enabled: true
  nameOverride: gpu-feature-discovery
  namespaceOverride: <NAMESPACE>
  nodeSelector: {
    nvidia.com/gpu: "true"
  }

nfd:
  master:
    nodeSelector: {
      nvidia.com/gpu: "true"
    }
    tolerations:
    - key: "nvidia.com/gpu"
      operator: "Exists"
      effect: "NoSchedule"
  worker:
    nodeSelector: {
      nvidia.com/gpu: "true"
    }

config: 
  name: nvidia-device-plugin-config
# nvidia-device-plugin-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin-config
  namespace: <NAMESPACE>
data:
  config: |-
    version: v1
    sharing:
      mps:
        renameByDefault: false
        resources:
          - name: nvidia.com/gpu
            replicas: 20

Additional information that might help better understand your environment and reproduce the bug:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    Off |   00000000:BE:00.0 Off |                    0 |
| N/A   67C    P0            279W /  350W |    3809MiB /  46068MiB |     97%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
klueska commented 3 months ago

I haven't read your issue in detail, but maybe this will help: https://docs.google.com/document/d/1H-ddA11laPQf_1olwXRjEDbzNihxprjPr74pZ4Vdf2M/edit

RonanQuigley commented 3 months ago

Furthermore, the presence of the nvidia.com/mps.capable=true label triggers the creation of a daemonset to manage the MPS control daemon.

Thanks, so I did read this doc before posting the issue. The problem is that this never happens.

RonanQuigley commented 3 months ago

So I don't know why, but if I reboot the offending machines after enabling MPS via the config map then the mps control daemon pods startup.

It'd be good to get to the bottom of why this is, as it took me hours to figure this out plus others might be having the same problem. Any ideas on what I can look at?

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.