Closed southquist closed 2 years ago
@southquist can you share how did you modify scheduler configuration on k3s ? which version are you using ? how did you deploy the cluster with which command ?
Hi @RotemEmergi I'm following the instructions from https://github.com/AliyunContainerService/gpushare-scheduler-extender/blob/master/docs/install.md adjusting them for RKE2 where it's needed, only thing that differs is how I went about modifying the scheduler policy config(Detailed below). The rest is following the how-to.
The RKE2 server supports settings 'kube-scheduler-extra-mount:' and 'kube-scheduler-arg:' in it's configuration file.
Source: https://docs.rke2.io/install/install_options/server_config/
I used these to:
The kube-scheduler starts up with no issues after that.
As for the version of k3, I assume the versioning of RKE2 and k3 goes hand in hand so that would be v1.21.6.
There's no one command that we use to deploy the cluster. Installation is outlined in the RKE2 docs: https://docs.rke2.io/install/quickstart/
After installation the kube-config can found at /etc/rancher/rke2/rke2.yaml on the master nodes.
@southquist
I used this command line to deploy the cluster of k3s
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION='v1.21.8+k3s2' sh -s - \
--disable traefik \
--node-label container-runtime=nvidia \
--node-label nvidia.com/gpu=true \
--kubelet-arg container-log-max-files=4 \
--kubelet-arg container-log-max-size=15Mi \
--kube-apiserver-arg service-node-port-range=27016-27027 \
--flannel-backend host-gw \
--cluster-cidr 10.244.0.0/16
I can see that we can add this flag - --kube-scheduler-arg from this doc https://rancher.com/docs/k3s/latest/en/installation/install-options/server-config/
Look like the same, no ?
@RotemEmergi
Okay.
Yep that is the same same option I set. You probably need to add something like this to the list:
--kube-scheduler-arg "--policy-config-file=/path/to/scheduler-policy-config.json"
And of course /path/to/scheduler-policy-config.json needs to be reachable from wherever the kube-scheduler process is running.
After a lot of trial and error it turned out to be the containerd configuration. This is the containerd config.toml.tmpl I settled on. Now I'm able to get the device plugin pod started.
[plugins.opt]
path = "/var/lib/rancher/rke2/agent/containerd"
[plugins.cri]
stream_server_address = "127.0.0.1"
stream_server_port = "10010"
enable_selinux = false
sandbox_image = "index.docker.io/rancher/pause:3.5"
[plugins.cri.containerd]
disable_snapshot_annotations = true
snapshotter = "overlayfs"
[plugins.cri.containerd.runtimes.runc]
runtime_type = "io.containerd.runtime.v1.linux"
[plugins.linux]
runtime = "nvidia-container-runtime"
我已经收到了!!!!
Hey everyone, I'm trying out the gpu-share-scheduler-extender on a RKE2 cluster, I've gone through all the steps:
But fails on the last one to get the device-plugin pod to start.
My setup.
My containerd config.toml.
I can successfully start a container directly in containerd using the ctr command, and run nvidia-smi.
But trying to start the gpushare-device-plugin pod, cuda, or tensor pod through kubernetes all fails with the error below. No matter what command I try to run inside the pod.
Although I can start a ordinary ubuntu pod on the gpu node without issue. Anyone have any idea's on what the problem might be? Or where one should start troubleshooting?