Open 631068264 opened 1 year ago
Hi!
I have the same problem. After getting trough the setup process everything looked good. We wanted to start the nvidia-version-check pod:
apiVersion: v1
kind: Pod
metadata:
name: nvidia-version-check
spec:
runtimeClassName: nvidia
restartPolicy: OnFailure
containers:
- name: nvidia-version-check
image: "nvidia/cuda:11.0.3-base-ubuntu20.04"
command: ["nvidia-smi"]
resources:
limits:
# GiB
aliyun.com/gpu-mem: 8
The following error occurred:
Warning FailedScheduling 26s default-scheduler failed bind with extender at URL http://127.0.0.1:32766/gpushare-scheduler/bind, code 500
The device plugin and the schd-extender seem to be running fine:
kubectl get pods -n kube-system |grep gpushare
gpushare-device-plugin-ds-rntz8 1/1 Running 0 11m
gpushare-schd-extender-5d8d4c849f-szkqw 1/1 Running 0 41m
I'm using kubernetes version 1.26.5
It seems to be very helpful the check the output of the gpushare-schd-extender:
[ warn ] 2023/06/21 18:11:36 gpushare-bind.go:36: Failed to handle pod jupyterlab-8c7544cbc-4m8pk in ns default due to error Pod "nvidia-version-check" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds`, `spec.tolerations` (only additions to existing tolerations) or `spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative)
core.PodSpec{
... // 24 identical fields
DNSConfig: nil,
ReadinessGates: nil,
- RuntimeClassName: &"nvidia",
+ RuntimeClassName: nil,
Overhead: nil,
EnableServiceLinks: &true,
... // 4 identical fields
}
In our case the problem was, that the scheduler (or another component) tried to change to RuntimeClassName
back to nil. As we did not want the every pod runs in the nvidia runtime we explicitly not set this as the default runtime but specified this for deployments which needed GPU access. But this seams not to be compatible with the scheduler.
@631068264 maybe you can also share the outputs of: kubectl logs -n kube-system gpushare-schd-extender-XXX
Finally I follow this https://github.com/kserve/kserve/issues/924 and use volcano and Time-Slicing GPUs in Kubernetes to solve gpu-mem share.
This is my note
使用 kubeflow1.6.1
k8s 1.22
k8s-gpushare-schd-extender:1.11-d170d8a
InferenceService
error