NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.42k stars 568 forks source link

MPS use error: Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)! #647

Open lengrongfu opened 2 weeks ago

lengrongfu commented 2 weeks ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

**Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case [here] (https://enterprise-support.nvidia.com/s/create-case)..

1. Quick Debug Information

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior.

I use helm to deploy k8s-device-plugin, and config mps, but deploy a workload running error. mps-controller-daemon pod having running.

3. Information to attach (optional if deemed irrelevant)

I use gpu-operator to install gpu driver, use helm chart version is v23.9.1, and driver、toolkit having install success. and then i use followers helm command to install k8s-device-plugin:

$ helm upgrade -i nvdp nvdp/nvidia-device-plugin     --version=0.15.0-rc.2     --namespace nvidia-device-plugin     --create-namespace     --set config.name=nvidia-plugin-configs --set gfd.enabled=true

nvidia-plugin-configs config content is :

    version: v1
    sharing:
      mps:
        resources:
        - name: nvidia.com/gpu
          replicas: 10

deploy workload pod command is:

$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF

and then pod status is Error, error log is:

Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)!
[Vector addition of 50000 elements]

GPU info :

root@nvidia-driver-daemonset-4p4qs:/drivers# nvidia-smi -L
GPU 0: Tesla P40 (UUID: GPU-70a7e30d-99a5-1117-8e85-759a592fb582)
image
lengrongfu commented 2 weeks ago

@elezar Can you help me look into this issue?

elezar commented 2 weeks ago

Could you try to update your workload to use the following container instead:

nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1

Also, is the nvidia runtime configured as your default runtime, or are you using a runtime class? If it is the latter, you would also need to specify a runtime class in your workload.

lengrongfu commented 2 weeks ago

I use nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1 this image to deploy workload, error then having.

Failed to allocate device vector A (error code CUDA-capable device(s) is/are busy or unavailable)!
[Vector addition of 50000 elements]

nvidia runtime is configured default.

$ cat /etc/containerd/config.toml
disabled_plugins = []
imports = []
oom_score = 0
plugin_dir = ""
required_plugins = []
root = "/var/lib/containerd"
state = "/run/containerd"
temp = ""
version = 2

[cgroup]
  path = ""

[debug]
  address = ""
  format = ""
  gid = 0
  level = ""
  uid = 0

[grpc]
  address = "/run/containerd/containerd.sock"
  gid = 0
  max_recv_message_size = 16777216
  max_send_message_size = 16777216
  tcp_address = ""
  tcp_tls_ca = ""
  tcp_tls_cert = ""
  tcp_tls_key = ""
  uid = 0

[metrics]
  address = ""
  grpc_histogram = false

[plugins]

  [plugins."io.containerd.gc.v1.scheduler"]
    deletion_threshold = 0
    mutation_threshold = 100
    pause_threshold = 0.02
    schedule_delay = "0s"
    startup_delay = "100ms"

  [plugins."io.containerd.grpc.v1.cri"]
    cdi_spec_dirs = ["/etc/cdi", "/var/run/cdi"]
    device_ownership_from_security_context = false
    disable_apparmor = false
    disable_cgroup = false
    disable_hugetlb_controller = true
    disable_proc_mount = false
    disable_tcp_service = true
    enable_cdi = false
    enable_selinux = false
    enable_tls_streaming = false
    enable_unprivileged_icmp = false
    enable_unprivileged_ports = false
    ignore_image_defined_volumes = false
    max_concurrent_downloads = 3
    max_container_log_line_size = 16384
    netns_mounts_under_state_dir = false
    restrict_oom_score_adj = false
    sandbox_image = "easzlab.io.local:5000/easzlab/pause:3.9"
    selinux_category_range = 1024
    stats_collect_period = 10
    stream_idle_timeout = "4h0m0s"
    stream_server_address = "127.0.0.1"
    stream_server_port = "0"
    systemd_cgroup = false
    tolerate_missing_hugetlb_controller = true
    unset_seccomp_profile = ""

    [plugins."io.containerd.grpc.v1.cri".cni]
      bin_dir = "/opt/cni/bin"
      conf_dir = "/etc/cni/net.d"
      conf_template = "/etc/cni/net.d/10-default.conf"
      max_conf_num = 1

    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"
      disable_snapshot_annotations = true
      discard_unpacked_layers = false
      ignore_rdt_not_enabled_errors = false
      no_pivot = false
      snapshotter = "overlayfs"

      [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
        base_runtime_spec = ""
        container_annotations = []
        pod_annotations = []
        privileged_without_host_devices = false
        runtime_engine = ""
        runtime_root = ""
        runtime_type = ""

        [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime.options]

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          base_runtime_spec = ""
          container_annotations = []
          pod_annotations = []
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
            CriuImagePath = ""
            CriuPath = ""
            CriuWorkPath = ""
            IoGid = 0
            IoUid = 0
            NoNewKeyring = false
            NoPivotRoot = false
            Root = ""
            ShimCgroup = ""
            SystemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-cdi]
          base_runtime_spec = ""
          container_annotations = []
          pod_annotations = []
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-cdi.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi"
            CriuImagePath = ""
            CriuPath = ""
            CriuWorkPath = ""
            IoGid = 0
            IoUid = 0
            NoNewKeyring = false
            NoPivotRoot = false
            Root = ""
            ShimCgroup = ""
            SystemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-legacy]
          base_runtime_spec = ""
          container_annotations = []
          pod_annotations = []
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-legacy.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy"
            CriuImagePath = ""
            CriuPath = ""
            CriuWorkPath = ""
            IoGid = 0
            IoUid = 0
            NoNewKeyring = false
            NoPivotRoot = false
            Root = ""
            ShimCgroup = ""
            SystemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          base_runtime_spec = ""
          container_annotations = []
          pod_annotations = []
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            BinaryName = ""
            CriuImagePath = ""
            CriuPath = ""
            CriuWorkPath = ""
            IoGid = 0
            IoUid = 0
            NoNewKeyring = false
            NoPivotRoot = false
            Root = ""
            ShimCgroup = ""
            SystemdCgroup = true

      [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
        base_runtime_spec = ""
        container_annotations = []
        pod_annotations = []
        privileged_without_host_devices = false
        runtime_engine = ""
        runtime_root = ""
        runtime_type = ""

        [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime.options]

    [plugins."io.containerd.grpc.v1.cri".image_decryption]
      key_model = "node"

    [plugins."io.containerd.grpc.v1.cri".registry]

      [plugins."io.containerd.grpc.v1.cri".registry.auths]

      [plugins."io.containerd.grpc.v1.cri".registry.configs]

        [plugins."io.containerd.grpc.v1.cri".registry.configs."10.6.194.8"]

          [plugins."io.containerd.grpc.v1.cri".registry.configs."10.6.194.8".tls]
            insecure_skip_verify = true

        [plugins."io.containerd.grpc.v1.cri".registry.configs."easzlab.io.local:5000"]

          [plugins."io.containerd.grpc.v1.cri".registry.configs."easzlab.io.local:5000".tls]
            insecure_skip_verify = true

        [plugins."io.containerd.grpc.v1.cri".registry.configs."harbor.easzlab.io.local:8443"]

          [plugins."io.containerd.grpc.v1.cri".registry.configs."harbor.easzlab.io.local:8443".tls]
            insecure_skip_verify = true

      [plugins."io.containerd.grpc.v1.cri".registry.headers]

      [plugins."io.containerd.grpc.v1.cri".registry.mirrors]

        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
          endpoint = ["https://docker.nju.edu.cn/", "https://kuamavit.mirror.aliyuncs.com"]

        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."easzlab.io.local:5000"]
          endpoint = ["http://easzlab.io.local:5000"]

        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."gcr.io"]
          endpoint = ["https://gcr.nju.edu.cn"]

        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."ghcr.io"]
          endpoint = ["https://ghcr.nju.edu.cn"]

        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."harbor.easzlab.io.local:8443"]
          endpoint = ["https://harbor.easzlab.io.local:8443"]

        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."k8s.gcr.io"]
          endpoint = ["https://gcr.nju.edu.cn/google-containers/"]

        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."nvcr.io"]
          endpoint = ["https://ngc.nju.edu.cn"]

        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."quay.io"]
          endpoint = ["https://quay.nju.edu.cn"]

    [plugins."io.containerd.grpc.v1.cri".x509_key_pair_streaming]
      tls_cert_file = ""
      tls_key_file = ""

  [plugins."io.containerd.internal.v1.opt"]
    path = "/opt/containerd"

  [plugins."io.containerd.internal.v1.restart"]
    interval = "10s"

  [plugins."io.containerd.metadata.v1.bolt"]
    content_sharing_policy = "shared"

  [plugins."io.containerd.monitor.v1.cgroups"]
    no_prometheus = false

  [plugins."io.containerd.nri.v1.nri"]
    disable = false
    disable_connections = false
    plugin_config_path = "/etc/nri/conf.d"
    plugin_path = "/opt/nri/plugins1"
    plugin_registration_timeout = "5s"
    plugin_request_timeout = "2s"
    socket_path = "/var/run/nri/nri.sock"

  [plugins."io.containerd.runtime.v1.linux"]
    no_shim = false
    runtime = "runc"
    runtime_root = ""
    shim = "containerd-shim"
    shim_debug = false

  [plugins."io.containerd.service.v1.diff-service"]
    default = ["walking"]

  [plugins."io.containerd.snapshotter.v1.aufs"]
    root_path = ""

  [plugins."io.containerd.snapshotter.v1.btrfs"]
    root_path = ""

  [plugins."io.containerd.snapshotter.v1.devmapper"]
    async_remove = false
    base_image_size = ""
    pool_name = ""
    root_path = ""

  [plugins."io.containerd.snapshotter.v1.native"]
    root_path = ""

  [plugins."io.containerd.snapshotter.v1.overlayfs"]
    root_path = ""

  [plugins."io.containerd.snapshotter.v1.zfs"]
    root_path = ""

[proxy_plugins]

[stream_processors]

  [stream_processors."io.containerd.ocicrypt.decoder.v1.tar"]
    accepts = ["application/vnd.oci.image.layer.v1.tar+encrypted"]
    args = ["--decryption-keys-path", "/etc/containerd/ocicrypt/keys"]
    env = ["OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf"]
    path = "ctd-decoder"
    returns = "application/vnd.oci.image.layer.v1.tar"

  [stream_processors."io.containerd.ocicrypt.decoder.v1.tar.gzip"]
    accepts = ["application/vnd.oci.image.layer.v1.tar+gzip+encrypted"]
    args = ["--decryption-keys-path", "/etc/containerd/ocicrypt/keys"]
    env = ["OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf"]
    path = "ctd-decoder"
    returns = "application/vnd.oci.image.layer.v1.tar+gzip"

[timeouts]
  "io.containerd.timeout.shim.cleanup" = "5s"
  "io.containerd.timeout.shim.load" = "5s"
  "io.containerd.timeout.shim.shutdown" = "3s"
  "io.containerd.timeout.task.state" = "2s"

[ttrpc]
  address = ""
  gid = 0
  uid = 0
lengrongfu commented 2 weeks ago

I found that it is possible to run the mps program directly on the host, but in the container it will prompt that device(s) is/are busy or unavailable

elezar commented 2 weeks ago

I found that it is possible to run the mps program directly on the host, but in the container it will prompt that device(s) is/are busy or unavailable

Could you provide more information on how you achieved this. Note that one of the key communication mechanism between the MPS processes is the /dev/shm that we create for the containerized daemon. How are you injecting this into the container?

lengrongfu commented 2 weeks ago

I found that it is possible to run the mps program directly on the host, but in the container it will prompt that device(s) is/are busy or unavailable

Could you provide more information on how you achieved this. Note that one of the key communication mechanism between the MPS processes is the /dev/shm that we create for the containerized daemon. How are you injecting this into the container?

First thanks for the quick answer.

About me in container to use MPS step:

  1. i use gpu-operatpr to install driver and toolkit.
  2. i use k8s-device-plugin to deploy mps controller and device-plugin
    helm upgrade -i nvdp nvdp/nvidia-device-plugin     --version=0.15.0-rc.2     --namespace nvidia-device-plugin     --create-namespace     --set config.name=nvidia-plugin-configs --set gfd.enabled=true
  3. then i to deploy a workload use follow command
    $ cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: Pod
    metadata:
    name: gpu-pod
    spec:
    restartPolicy: Never
    containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
    tolerations:
    - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
    EOF

About you tips MPS processes is /dev/shm to communication, what do I need to do with this?

lengrongfu commented 2 weeks ago

@elezar Need any more information?

elezar commented 1 week ago

Sorry for the delay, @lengrongfu. Since you're using the GPU Operator to install the other components of the NVIDIA Container Stack. Can you confirm that it isn't managing the device plugin? Which pods are running in the GPU Operator namespace?

Also, to rule out any issues in the rc.2, could you deploy the v0.15.0 version of the device plugin that was released last week.

It would also be good to confirm that the workload container can properly access the MPS control daemon with the correct settings. Here, I would recommend updating the command to sleep 9999 and then exec into the container and run:

echo get_default_active_thread_percentage | mps-control-daemon

This should give 10 in your case.

lengrongfu commented 1 week ago

Thank you for your reply.

image image image
lengrongfu commented 1 week ago

I watch mps-control-daemon having a log is User did not send valid credentials, will this have any impact?

image

in nvidia-driver-daemonset pod exec nvidia-smi command can see nvidia-cuda-mps-server process is in use GPU device.

image

GPU compute is Exclusive_Process:

image
elezar commented 1 week ago

Could you run:

echo get_default_active_thread_percentage | nvidia-cuda-mps-control

in a workload conatiner: For example, the following one:

$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1
      command: ["bash", "-c"]
      args: ["nvidia-smi -L; sleep 9999"]
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF
lengrongfu commented 1 week ago

echo get_default_active_thread_percentage | mps-control-daemon

mps-control-daemon: command not found

image
elezar commented 1 week ago

Sorry it should be echo get_default_active_thread_percentage | nvidia-cuda-mps-control. A typo from my side.

lengrongfu commented 1 week ago

It return value is 10.0.

image
elezar commented 1 week ago

Just as a sanity check, could you confirm that running nvidia-smi produces the same output as in the driver container?

Looking through the configs again, since the GPU Operator is being used configure the toolkit and the driver, I would expect the nvidiaDriverRoot for the device plugin to be set to /run/nvidia/driver and not:

    "nvidiaDriverRoot": "/",

as is shown in your config.

Could you update the device plugin deployment with --set nvidiaDriverRoot=/run/nvidia/driver?

lengrongfu commented 1 week ago

Use helm update nvidiaDriverRoot field, and add a volume todevice-plugin pod, then pod start success, but gpu-pod run error.

image image
lengrongfu commented 1 week ago

Maybe it has something to do with Tesla P40.

robrakaric commented 3 days ago

Maybe it has something to do with Tesla P40.

I'm running into the same issue on a GTX1070 w/ the same driver version as you. I wonder if a driver update would help.