AliyunContainerService / gpushare-scheduler-extender

GPU Sharing Scheduler for Kubernetes Cluster
Apache License 2.0
1.39k stars 309 forks source link

Back-off restarting failed container: gpushare-device-plugin-ds-xxxxx #210

Open JiangLingJun opened 1 year ago

JiangLingJun commented 1 year ago
Environmental Info:

k3s version v1.24.8+k3s1 (https://github.com/k3s-io/k3s/commit/648004e4faeaf9e8705386342e95ec9bd211c2b8) go version go1.18.8

Node(s) CPU architecture, OS, and Version:

node202: x86_64, Ubuntu 18.04.6 LTS, kernel version 5.4.0-136-generic node203: x86_64, Ubuntu 18.04.6 LTS, kernel version 5.4.0-136-generic node204: arch64, Ubuntu 20.04.5 LTS, kernel version 5.10.104-tegra

Cluster Configuration:

1 server, 2 agents. server: node202 (192.168.1.202) agent: node203 (192.168.1.203) and node204 (192.168.1.204)

Describe the bug:

POD gpushare-schd-extender-xxx could be run on node202, but POD gpushare-device-plugin-ds-xxx couldn't be run on node204.

Steps To Reproduce:

Follow the steps of link https://github.com/AliyunContainerService/gpushare-scheduler-extender/blob/master/docs/install.md

Configuration Info:

daemon.json in /etc/docker on node204

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "insecure-registries": ["192.168.1.229:5000"]
}

kube-scheduler.yaml in /etc/kubernetes/manifest on node202

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-scheduler
    tier: control-plane
  name: kube-scheduler
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-scheduler
    - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
    - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
    - --bind-address=127.0.0.1
    - --kubeconfig=/etc/kubernetes/scheduler.conf
    - --leader-elect=true
    - --config=/etc/kubernetes/scheduler-policy-config.yaml
    - --policy-config-file=/home/u18/k3sgpushare/scheduler-policy-config.yaml
    image: k8s.gcr.io/kube-scheduler:v1.23.3
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10259
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    name: kube-scheduler
    resources:
      requests:
        cpu: 100m
    startupProbe:
      failureThreshold: 24
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10259
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    volumeMounts:
    - mountPath: /etc/kubernetes/scheduler.conf
      name: kubeconfig
      readOnly: true
    - mountPath: /etc/kubernetes/scheduler-policy-config.yaml
      name: scheduler-policy-config
      readOnly: true
  hostNetwork: true
  priorityClassName: system-node-critical
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  volumes:
  - hostPath:
      path: /etc/kubernetes/scheduler.conf
      type: FileOrCreate
    name: kubeconfig
  - hostPath:
      path: /etc/kubernetes/scheduler-policy-config.yaml
      type: FileOrCreate
    name: scheduler-policy-config
status: {}

scheduler-policy-config.yaml in /etc/kubernetes on node202

{
  "kind": "Policy",
  "apiVersion": "v1",
  "extenders": [
    {
      "urlPrefix": "http://127.0.0.1:32766/gpushare-scheduler",
      "filterVerb": "filter",
      "bindVerb":   "bind",
      "enableHttps": false,
      "nodeCacheCapable": true,
      "managedResources": [
        {
          "name": "aliyun.com/gpu-mem",
          "ignoredByScheduler": false
        }
      ],
      "ignorable": false
    }
  ]
}

kubectl-inspect-gpushare in /usr/bin on node202

u18@node202:/usr/bin$ ls -al /usr/bin/ | grep gpushare
-rwxrw-r--  1 u18  u18     37310113 12月  7  2021 kubectl-inspect-gpushare

default GPU device plugin be removed

u18@node202:/usr/bin$ kubectl get pod nvidia -n=kube-system
Error from server (NotFound): pods "nvidia" not found

"gpushare=true" be labeled on node204

u18@node202:/usr/bin$ kubectl  get node --show-labels=true | grep gpushare=true
node204   Ready    <none>                 42d   v1.24.8+k3s1   beta.kubernetes.io/arch=arm64,beta.kubernetes.io/instance-type=k3s,beta.kubernetes.io/os=linux,egress.k3s.io/cluster=true,gpushare=true,kubernetes.io/arch=arm64,kubernetes.io/hostname=node204,kubernetes.io/os=linux,node.kubernetes.io/instance-type=k3s,nodeShareGPU=true,nvidia.com/node=true
Expected behavior:

POD gpushare-schd-extender-xxx and POD gpushare-device-plugin-ds-xxxxx are functioning normally.

Actual behavior:
u18@node202:~/k3sgpushare$ k get pod -n=kube-system -owide | grep gpushare
gpushare-schd-extender-865f956968-vvlnr   1/1     Running            0              13m   192.168.1.202   node202   <none>           <none>
gpushare-device-plugin-ds-hgslb           0/1     CrashLoopBackOff   15 (65s ago)   52m   192.168.1.204   node204   <none>           <none>
Additional context / logs:
u18@node202:~/k3sgpushare$ k describe pod gpushare-device-plugin-ds -n=kube-system
Name:         gpushare-device-plugin-ds-fr49d
Namespace:    kube-system
Priority:     0
Node:         node204/192.168.1.204
Start Time:   Thu, 25 May 2023 17:16:48 +0800
Labels:       app=gpushare
              component=gpushare-device-plugin
              controller-revision-hash=7d7d6b77dd
              name=gpushare-device-plugin-ds
              pod-template-generation=1
Annotations:  scheduler.alpha.kubernetes.io/priorityClassName: system-cluster-critical
Status:       Running
IP:           192.168.1.204
IPs:
  IP:           192.168.1.204
Controlled By:  DaemonSet/gpushare-device-plugin-ds
Containers:
  gpushare:
    Container ID:  containerd://38eca65174bae8b74878074509ab3e69359558b326a8cfb61bbdcb4f59c66a73
    Image:         registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-plugin:v2-1.11-aff8a23
    Image ID:      registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-plugin@sha256:76769d69f5a5b24cbe117f8ac83a0ff7409fda6108ca982c8f3b8f763e016100
    Port:          <none>
    Host Port:     <none>
    Command:
      gpushare-device-plugin-v2
      -logtostderr
      --v=5
      --memory-unit=GiB//
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 25 May 2023 17:17:04 +0800
      Finished:     Thu, 25 May 2023 17:17:04 +0800
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 25 May 2023 17:16:49 +0800
      Finished:     Thu, 25 May 2023 17:16:49 +0800
    Ready:          False
    Restart Count:  2
    Limits:
      cpu:     1
      memory:  300Mi
    Requests:
      cpu:     1
      memory:  300Mi
    Environment:
      KUBECONFIG:  /etc/kubernetes/kubelet.conf
      NODE_NAME:    (v1:spec.nodeName)
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wdc25 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
  kube-api-access-wdc25:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              gpushare=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age               From               Message

----     ------     ----              ----               -------

  Normal   Scheduled  17s               default-scheduler  Successfully assigned kube-system/gpushare-device-plugin-ds-fr49d to node204
  Normal   Pulled     2s (x3 over 17s)  kubelet            Container image "registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-plugin:v2-1.11-aff8a23" already present on machine
  Normal   Created    2s (x3 over 17s)  kubelet            Created container gpushare
  Normal   Started    2s (x3 over 17s)  kubelet            Started container gpushare
  Warning  BackOff    1s (x3 over 16s)  kubelet            Back-off restarting failed container
fenwuyaoji commented 1 year ago

已收到您的邮件,我将及时查看并回复,谢谢                                                                                                                     王鑫