AliyunContainerService / gpushare-scheduler-extender

GPU Sharing Scheduler for Kubernetes Cluster
Apache License 2.0
1.39k stars 309 forks source link

使用kubeflow1.6.1 使用自定义镜像有问题 #199

Open 631068264 opened 1 year ago

631068264 commented 1 year ago

使用 kubeflow1.6.1
k8s 1.22

k8s-gpushare-schd-extender:1.11-d170d8a

InferenceService

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "firesmoke"
spec:
  predictor:
    containers:
      - name: kserve-container
        image: harbor.xxx.cn/library/model/firesmoke:v1
        env:
          - name: MODEL_NAME
            value: firesmoke
        resources:
          limits:
            aliyun.com/gpu-mem: 1

error

Status:
  Components:
    Predictor:
      Latest Created Revision:  firesmoke-predictor-default-00001
  Conditions:
    Last Transition Time:  2023-02-07T16:30:17Z
    Message:               Revision "firesmoke-predictor-default-00001" failed with message: binding rejected: failed bind with extender at URL http://127.0.0.1:32766/gpushare-scheduler/bind, code 500.
    Reason:                RevisionFailed
    Severity:              Info
    Status:                False
    Type:                  PredictorConfigurationReady
    Last Transition Time:  2023-02-07T16:30:17Z
    Message:               Configuration "firesmoke-predictor-default" does not have any ready Revision.
    Reason:                RevisionMissing
    Status:                False
    Type:                  PredictorReady
    Last Transition Time:  2023-02-07T16:30:17Z
    Message:               Configuration "firesmoke-predictor-default" does not have any ready Revision.
    Reason:                RevisionMissing
    Severity:              Info
    Status:                False
    Type:                  PredictorRouteReady
    Last Transition Time:  2023-02-07T16:30:17Z
    Message:               Configuration "firesmoke-predictor-default" does not have any ready Revision.
    Reason:                RevisionMissing
    Status:                False
    Type:                  Ready
Events:                    <none>
kubectl get pods -n kube-system |grep gpushare
gpushare-device-plugin-ds-h9h9t              1/1     Running     0          22h
gpushare-schd-extender-6774756f54-xk7sw      1/1     Running     0          22h
Strice91 commented 1 year ago

Hi!

I have the same problem. After getting trough the setup process everything looked good. We wanted to start the nvidia-version-check pod:

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-version-check
spec:
  runtimeClassName: nvidia
  restartPolicy: OnFailure
  containers:
  - name: nvidia-version-check
    image: "nvidia/cuda:11.0.3-base-ubuntu20.04"
    command: ["nvidia-smi"]
    resources:
      limits:
        # GiB
        aliyun.com/gpu-mem: 8

The following error occurred:

Warning FailedScheduling 26s default-scheduler failed bind with extender at URL http://127.0.0.1:32766/gpushare-scheduler/bind, code 500

The device plugin and the schd-extender seem to be running fine:

kubectl get pods -n kube-system |grep gpushare
gpushare-device-plugin-ds-rntz8           1/1     Running   0               11m
gpushare-schd-extender-5d8d4c849f-szkqw   1/1     Running   0               41m

I'm using kubernetes version 1.26.5

Strice91 commented 1 year ago

It seems to be very helpful the check the output of the gpushare-schd-extender:

[  warn ] 2023/06/21 18:11:36 gpushare-bind.go:36: Failed to handle pod jupyterlab-8c7544cbc-4m8pk in ns default due to error Pod "nvidia-version-check" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds`, `spec.tolerations` (only additions to existing tolerations) or `spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative)
  core.PodSpec{
        ... // 24 identical fields
        DNSConfig:          nil,
        ReadinessGates:     nil,
-       RuntimeClassName:   &"nvidia",
+       RuntimeClassName:   nil,
        Overhead:           nil,
        EnableServiceLinks: &true,
        ... // 4 identical fields
  }

   In our case the problem was, that the scheduler (or another component) tried to change to RuntimeClassName back to nil. As we did not want the every pod runs in the nvidia runtime we explicitly not set this as the default runtime but specified this for deployments which needed GPU access. But this seams not to be compatible with the scheduler.

@631068264 maybe you can also share the outputs of: kubectl logs -n kube-system gpushare-schd-extender-XXX

631068264 commented 1 year ago

Finally I follow this https://github.com/kserve/kserve/issues/924 and use volcano and Time-Slicing GPUs in Kubernetes to solve gpu-mem share.

This is my note