gpushare-device-plugin pod fails to start

southquist commented 2 years ago

Hey everyone, I'm trying out the gpu-share-scheduler-extender on a RKE2 cluster, I've gone through all the steps:

Deploy GPU share scheduler extender :white_check_mark:
Modify scheduler configuration :white_check_mark:
Add gpushare node labels to the nodes requiring GPU sharing :white_check_mark:

But fails on the last one to get the device-plugin pod to start.

My setup.

Kubernetes: RKE2 v1.21.6+rke2r1
Host OS: Ubuntu 20.04
CRI: containerd.
nvidia-container-runtime 3.8.1-1
nvidia-headless-460-server:amd64 470.103.01-0ubuntu0.20.04.1
nvidia-utils-460-server:amd64 470.103.01-0ubuntu0.20.04.1

My containerd config.toml.

version = 2

[plugins]

  [plugins."io.containerd.grpc.v1.cri"]
    enable_selinux = false
    sandbox_image = "index.docker.io/rancher/pause:3.2"
    stream_server_address = "127.0.0.1"
    stream_server_port = "10010"

    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"
      disable_snapshot_annotations = true
      snapshotter = "overlayfs"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental]
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"

  [plugins."io.containerd.internal.v1.opt"]
    path = "/data/rancher/rke2/agent/containerd"

I can successfully start a container directly in containerd using the ctr command, and run nvidia-smi.

# ctr -a /run/k3s/containerd/containerd.sock run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0-base cuda-11.0-base nvidia-smi
Mon Mar 14 07:33:43 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:0B:00.0 Off |                    0 |
| N/A   29C    P0    58W / 400W |      0MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

But trying to start the gpushare-device-plugin pod, cuda, or tensor pod through kubernetes all fails with the error below. No matter what command I try to run inside the pod.

Warning Failed 13s (x2 over 14s) kubelet Error: failed to create containerd task: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init
caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: driver error: failed to process request: unknown

Although I can start a ordinary ubuntu pod on the gpu node without issue. Anyone have any idea's on what the problem might be? Or where one should start troubleshooting?

RotemAmergi commented 2 years ago

@southquist can you share how did you modify scheduler configuration on k3s ? which version are you using ? how did you deploy the cluster with which command ?

southquist commented 2 years ago

Hi @RotemEmergi I'm following the instructions from https://github.com/AliyunContainerService/gpushare-scheduler-extender/blob/master/docs/install.md adjusting them for RKE2 where it's needed, only thing that differs is how I went about modifying the scheduler policy config(Detailed below). The rest is following the how-to.

The RKE2 server supports settings 'kube-scheduler-extra-mount:' and 'kube-scheduler-arg:' in it's configuration file.

Source: https://docs.rke2.io/install/install_options/server_config/

I used these to:

Mount the scheduler-policy-config.json file into the kube-scheduler pod.
Use 'kube-scheduler-arg' to set '--policy-config-file' and point it to the location of scheduler-policy-config.json in side the pod.

The kube-scheduler starts up with no issues after that.

As for the version of k3, I assume the versioning of RKE2 and k3 goes hand in hand so that would be v1.21.6.

There's no one command that we use to deploy the cluster. Installation is outlined in the RKE2 docs: https://docs.rke2.io/install/quickstart/

Server Node Installation on your controlplane nodes
Linux Agent (Worker) Node Installation on your worker node.

After installation the kube-config can found at /etc/rancher/rke2/rke2.yaml on the master nodes.

RotemAmergi commented 2 years ago

@southquist

I used this command line to deploy the cluster of k3s

  curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION='v1.21.8+k3s2' sh -s - \
  --disable traefik \
  --node-label container-runtime=nvidia \
  --node-label nvidia.com/gpu=true \
  --kubelet-arg container-log-max-files=4 \
  --kubelet-arg container-log-max-size=15Mi \
  --kube-apiserver-arg service-node-port-range=27016-27027 \
  --flannel-backend host-gw \
  --cluster-cidr 10.244.0.0/16

I can see that we can add this flag - --kube-scheduler-arg from this doc https://rancher.com/docs/k3s/latest/en/installation/install-options/server-config/

Look like the same, no ?

southquist commented 2 years ago

@RotemEmergi

Okay.

Yep that is the same same option I set. You probably need to add something like this to the list:

--kube-scheduler-arg "--policy-config-file=/path/to/scheduler-policy-config.json"

And of course /path/to/scheduler-policy-config.json needs to be reachable from wherever the kube-scheduler process is running.

southquist commented 2 years ago

After a lot of trial and error it turned out to be the containerd configuration. This is the containerd config.toml.tmpl I settled on. Now I'm able to get the device plugin pod started.

[plugins.opt]
  path = "/var/lib/rancher/rke2/agent/containerd"

[plugins.cri]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  sandbox_image = "index.docker.io/rancher/pause:3.5"

[plugins.cri.containerd]
  disable_snapshot_annotations = true
  snapshotter = "overlayfs"

[plugins.cri.containerd.runtimes.runc]
  runtime_type = "io.containerd.runtime.v1.linux"

[plugins.linux]
  runtime = "nvidia-container-runtime"

jamesislebron commented 2 years ago

我已经收到了！！！！

AliyunContainerService / gpushare-scheduler-extender

gpushare-device-plugin pod fails to start #173