AliyunContainerService / gpushare-scheduler-extender

GPU Sharing Scheduler for Kubernetes Cluster
Apache License 2.0
1.41k stars 309 forks source link

Support in EKS [Help] #145

Open pen-pal opened 3 years ago

pen-pal commented 3 years ago

Hi,

I tried to set this up in my EKS cluster, but I am observing the pods are in pending state and not running as expected

gpushare-installer-5s56q                  0/1     Pending   0          10h
gpushare-installer-mxhgk                  0/1     Pending   0          10h
gpushare-installer-n9k6z                  0/1     Pending   0          10h
gpushare-schd-extender-846977f446-s9bxh   0/1     Pending   0          10h

Describing the pods gpushare-installer-5s56q

  Warning  FailedScheduling  39s (x67 over 10h)  default-scheduler  0/7 nodes are available: 7 node(s) didn't match node selector.

Describing the pods gpushare-schd-extender-846977f446-s9bxh

 Warning  FailedScheduling  4m21s (x67 over 10h)  default-scheduler  0/7 nodes are available: 7 node(s) didn't match node selector.   

As per the documentation and going through the file ./templates/schd-config-job.yaml, and ./templates/gpushare-extender-deployment.yaml I need to setup a label as node-role.kubernetes.io/master: "" for node selector.

Also, this step by step guide https://github.com/AliyunContainerService/gpushare-scheduler-extender/blob/master/docs/install.md me to update the kubeschedular configuration.

On EKS, I am not sure where/how I can configure on which nodes should I update this configuration??

Guidance will be much appericiated.

happy2048 commented 3 years ago

The GPU Share Scheduler Extender needs to change the scheduler configuration and the pod "gpushare-installer*" is used to change the scheduler configuration, the schedulers are hosted on master nodes usually. The reason of the pods are Pending is that not found master node in this cluster, you can use following command to make sure:

$ kubectl get nodes
NAME                       STATUS   ROLES    AGE   VERSION
cn-beijing.192.168.8.44    Ready    <none>   12d   v1.16.9-aliyun.1
cn-beijing.192.168.8.45    Ready    <none>   12d   v1.16.9-aliyun.1
cn-beijing.192.168.8.46    Ready    <none>   12d   v1.16.9-aliyun.1
cn-beijing.192.168.9.159   Ready    master   12d   v1.16.9-aliyun.1
cn-beijing.192.168.9.160   Ready    master   12d   v1.16.9-aliyun.1
cn-beijing.192.168.9.161   Ready    master   12d   v1.16.9-aliyun.1

As you can see, the cluster has nodes whose role is "master". If you not found the master nodes, maybe the master nodes of the cluster are hosted on the another cluster, we call this cluster whose master nodes are hosted on another cluster as Managed Kubernetes Cluster in Alibaba Cloud.

You can get helps from the EKS and ask them how to enable a scheduler extender configuration for the scheduler.

fernandocamargoai commented 3 years ago

@M-A-N-I-S-H-K, did you manage to solve it? I have the same issue with AKS.

awoimbee commented 3 years ago

Does the pods really need access to master nodes ? It looks like a scheduler-extender should work on EKS without access to a master node, this project does it: https://github.com/marccampbell/graviton-scheduler-extender

animesh-agarwal commented 3 years ago

The scheduler can be deployed as a separate scheduler instead of modifying the default scheduler as done in https://github.com/AliyunContainerService/gpushare-scheduler-extender/blob/master/config/kube-scheduler.yaml#L18.

Instead of adding the config file to the master node, specify the scheduler configuration using a config map.

apiVersion: v1
kind: ConfigMap
metadata:
  name: gpushare-schd-extender-config
  namespace: kube-system
data:
  config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1alpha1
    kind: KubeSchedulerConfiguration
    algorithmSource:
      policy:
        configMap:
          namespace: kube-system
          name: gpushare-schd-extender-policy
    leaderElection:
      leaderElect: true
      lockObjectName: gpushare-schd-extender
      lockObjectNamespace: kube-system

apiVersion: v1
kind: ConfigMap
metadata:
  name: gpushare-schd-extender-policy
  namespace: kube-system
data:
 policy.cfg : |
  {
    "kind" : "Policy",
    "apiVersion" : "v1",
    "extenders" : [{
      "urlPrefix": "http://127.0.0.1:32766/gpushare-scheduler",
      "filterVerb": "filter",
      "bindVerb": "bind",
      "enableHttps": false,
      "nodeCacheCapable": true,
      "managedResources": [
        {
          "name": "aliyun.com/gpu-mem",
          "ignoredByScheduler": false
        }
      ],
      "ignorable": false
    }],
    "hardPodAffinitySymmetricWeight" : 10
  }

Mount the config map using volumes and deploy the new scheduler

spec:
    volumes:
    - name: gpushare-schd-extender-config
    configMap:
        name: gpushare-schd-extender-config
    containers:
    - name: connector
        image: gcr.io/google-containers/kube-scheduler:v1.18.0
        args:
        - kube-scheduler
        - --config=/gpushare-schd-extender/config.yaml
        volumeMounts:
        - name: gpushare-schd-extender-config
        mountPath: /gpushare-schd-extender

Finally, specify the new scheduler in the pod manifest

pod.schedulerName: gpushare-schd-extender
2811299 commented 3 years ago

Hi @animesh-agarwal , thank you very much for your reply and suggestion, it helps a lot. But I still cannot successfully use your method. Where should we we define the ConfigMap? it's the path /gpushare-schd-extender/config.yaml in your example? Where should we change this please pod.schedulerName: gpushare-schd-extender ?

animesh-agarwal commented 3 years ago

@2811299
Please find below the complete manifest to add a new scheduler. Please note that I have used cluster-admin cluster role for simplicity, you may choose to create a more specific role.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: gpushare-schd-extender
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gpushare-schd-extender-kube-scheduler
subjects:
- kind: ServiceAccount
  name: gpushare-schd-extender
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: cluster-admin
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gpushare-schd-extender-as-volume-scheduler
subjects:
- kind: ServiceAccount
  name: gpushare-schd-extender
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: cluster-admin
  apiGroup: rbac.authorization.k8s.io
---

apiVersion: v1
kind: ConfigMap
metadata:
  name: gpushare-schd-extender-config
  namespace: kube-system
data:
  config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1alpha1
    kind: KubeSchedulerConfiguration
    algorithmSource:
      policy:
        configMap:
          namespace: kube-system
          name: gpushare-schd-extender-policy
    leaderElection:
      leaderElect: true
      lockObjectName: gpushare-schd-extender
      lockObjectNamespace: kube-system

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpushare-schd-extender-policy
  namespace: kube-system
data:
 policy.cfg : |
  {
    "kind" : "Policy",
    "apiVersion" : "v1",
    "extenders" : [{
      "urlPrefix": "http://127.0.0.1:32766/gpushare-scheduler",
      "filterVerb": "filter",
      "bindVerb": "bind",
      "enableHttps": false,
      "nodeCacheCapable": true,
      "managedResources": [
        {
          "name": "aliyun.com/gpu-mem",
          "ignoredByScheduler": false
        }
      ],
      "ignorable": false
    }],
    "hardPodAffinitySymmetricWeight" : 10
  }

---
kind: Deployment
apiVersion: apps/v1
metadata:
  name: gpushare-schd-extender
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
        app: gpushare
        component: gpushare-schd-extender
  template:
    metadata:
      labels:
        app: gpushare
        component: gpushare-schd-extender
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ''
    spec:
      serviceAccountName: gpushare-schd-extender
      volumes:
      - name: gpushare-schd-extender-config
        configMap:
          name: gpushare-schd-extender-config
      containers:
        - name: gpushare-schd-extender
          image: registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-schd-extender:1.11-d170d8a
          env:
          - name: LOG_LEVEL
            value: debug
          - name: PORT
            value: "12345"
        - name: connector
          image: gcr.io/google-containers/kube-scheduler:v1.18.0
          args:
          - kube-scheduler
          - --config=/gpushare-schd-extender/config.yaml
          volumeMounts:
          - name: gpushare-schd-extender-config
            mountPath: /gpushare-schd-extender

# service.yaml            
---
apiVersion: v1
kind: Service
metadata:
  name: gpushare-schd-extender
  namespace: kube-system
  labels:
    app: gpushare
    component: gpushare-schd-extender
spec:
  type: NodePort
  ports:
  - port: 12345
    name: http
    targetPort: 12345
    nodePort: 32766
  selector:
    # select app=ingress-nginx pods
    app: gpushare
    component: gpushare-schd-extender

Please note that the scheduler will be created inside the kube-system namespace. You can verify if the scheduler pod is running using kubectl get pods --namespace=kube-system

Please follow this to understand how to use the newly deployed scheduler in your pods

2811299 commented 3 years ago

@fernandocamargoai Hi, does this method works for you for AKS?

fernandocamargoai commented 3 years ago

@fernandocamargoai Hi, does this method works for you for AKS?

I'm not actively working on that project anymore, but I sent them the link to this issue for them to try it in the future. When they try it and let me know, I'll comment here.

2811299 commented 3 years ago

Is there anyone be able to verify animesh's method works for AKS?

michaelmohamed commented 2 years ago

@2811299 Please find below the complete manifest to add a new scheduler. Please note that I have used cluster-admin cluster role for simplicity, you may choose to create a more specific role.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: gpushare-schd-extender
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gpushare-schd-extender-kube-scheduler
subjects:
- kind: ServiceAccount
  name: gpushare-schd-extender
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: cluster-admin
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gpushare-schd-extender-as-volume-scheduler
subjects:
- kind: ServiceAccount
  name: gpushare-schd-extender
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: cluster-admin
  apiGroup: rbac.authorization.k8s.io
---

apiVersion: v1
kind: ConfigMap
metadata:
  name: gpushare-schd-extender-config
  namespace: kube-system
data:
  config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1alpha1
    kind: KubeSchedulerConfiguration
    algorithmSource:
      policy:
        configMap:
          namespace: kube-system
          name: gpushare-schd-extender-policy
    leaderElection:
      leaderElect: true
      lockObjectName: gpushare-schd-extender
      lockObjectNamespace: kube-system

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpushare-schd-extender-policy
  namespace: kube-system
data:
 policy.cfg : |
  {
    "kind" : "Policy",
    "apiVersion" : "v1",
    "extenders" : [{
      "urlPrefix": "http://127.0.0.1:32766/gpushare-scheduler",
      "filterVerb": "filter",
      "bindVerb": "bind",
      "enableHttps": false,
      "nodeCacheCapable": true,
      "managedResources": [
        {
          "name": "aliyun.com/gpu-mem",
          "ignoredByScheduler": false
        }
      ],
      "ignorable": false
    }],
    "hardPodAffinitySymmetricWeight" : 10
  }

---
kind: Deployment
apiVersion: apps/v1
metadata:
  name: gpushare-schd-extender
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
        app: gpushare
        component: gpushare-schd-extender
  template:
    metadata:
      labels:
        app: gpushare
        component: gpushare-schd-extender
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ''
    spec:
      serviceAccountName: gpushare-schd-extender
      volumes:
      - name: gpushare-schd-extender-config
        configMap:
          name: gpushare-schd-extender-config
      containers:
        - name: gpushare-schd-extender
          image: registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-schd-extender:1.11-d170d8a
          env:
          - name: LOG_LEVEL
            value: debug
          - name: PORT
            value: "12345"
        - name: connector
          image: gcr.io/google-containers/kube-scheduler:v1.18.0
          args:
          - kube-scheduler
          - --config=/gpushare-schd-extender/config.yaml
          volumeMounts:
          - name: gpushare-schd-extender-config
            mountPath: /gpushare-schd-extender

# service.yaml            
---
apiVersion: v1
kind: Service
metadata:
  name: gpushare-schd-extender
  namespace: kube-system
  labels:
    app: gpushare
    component: gpushare-schd-extender
spec:
  type: NodePort
  ports:
  - port: 12345
    name: http
    targetPort: 12345
    nodePort: 32766
  selector:
    # select app=ingress-nginx pods
    app: gpushare
    component: gpushare-schd-extender

Please note that the scheduler will be created inside the kube-system namespace. You can verify if the scheduler pod is running using kubectl get pods --namespace=kube-system

Please follow this to understand how to use the newly deployed scheduler in your pods

Confirmed this works in EKS.

mariusehr1 commented 2 years ago

hello @mm-e1

I tried using the piece of yaml you mentionned on EKS with the default plugin deployed without any success:

I tried port 32766 without any luck and I switched over to 12345

Using the custom scheduler withing my pods I would end up in Pending state for ever.

Could you give me a bit more detail on how you proceeded with the installation?

Thanks Marius

amybachir commented 2 years ago

@mariusehr1 The scheduler extender worked for me. Did you prep your nodes correctly? By labeling them with gpushare=true?

suchisur commented 1 year ago

What is the name of the scheduler created @animesh-agarwal ? is there anywhere else except the pod manifest where I need to mention it? My pods show pending state and do not come up on mentioning schedulerName: gpushare-schd-extender

suchisur commented 1 year ago

@animesh-agarwal since kubernetes v1.24, there has been a removel of scheduling policies are no longer supported instead scheduler configurations should be used. Hence the configuration you provided is not working, can you please help me out in setting this up in Kubernetes v1.23+? have tried using the new KubeSchedulerConfiguration by editing the configmap. The image has changed as well, and the pods do not come up. Any help would be appreciated

YuuinIH commented 1 year ago

Hi! I have successfully deployed gpushare-scheduler-extender on KubernetesV1.23 or above in EKS. I have published the detailed steps here, hoping it will behelpful to you!

1. Deploy GPU share scheduler extender

kubectl create -f https://gist.githubusercontent.com/YuuinIH/71b025b7e63291e6a7d5f3cc43e76805/raw/a1e530e03cc985891a33e8fc2ed2f26307061b0b/gpushare-schd-extender.yaml

2.Deploy GPU share scheduler

kubectl create -f https://gist.githubusercontent.com/YuuinIH/71b025b7e63291e6a7d5f3cc43e76805/raw/2c5d874b6061e0497274779ab59ac2c240c4817a/gpushare-scheduler.yaml

3.Update the system:kube-scheduler cluster role to Enable scheduler leader election

According to https://kubernetes.io/docs/tasks/extend-kubernetes/configure-multiple-schedulers/#define-a-kubernetes-deployment-for-the-scheduler

kubectl edit clusterrole system:kube-scheduler
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  annotations:
    rbac.authorization.kubernetes.io/autoupdate: "true"
  labels:
    kubernetes.io/bootstrapping: rbac-defaults
  name: system:kube-scheduler
rules:
  - apiGroups:
      - coordination.k8s.io
    resources:
      - leases
    verbs:
      - create
  - apiGroups:
      - coordination.k8s.io
    resourceNames:
      - kube-scheduler
      - gpushare-scheduler
    resources:
      - leases
    verbs:
      - get
      - update
  - apiGroups:
      - ""
    resourceNames:
      - kube-scheduler
      - gpushare-scheduler
    resources:
      - endpoints
    verbs:
      - delete
      - get
      - patch
      - update

4.Deploy Device plugins

Here is the same as the official guide.

kubectl delete ds -n kube-system nvidia-device-plugin-daemonset
kubectl create -f https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-rbac.yaml
kubectl create -f https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-ds.yaml

5.After that, run pod with specify schedulers.

apiVersion: batch/v1
kind: Job
metadata:
  name: gpu-share-sample
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        app: gpu-share-sample
    spec:
      schedulerName: gpushare-scheduler  #important!!!!!
      containers:
      - name: gpu-share-sample
        image: registry.cn-hangzhou.aliyuncs.com/ai-samples/gpushare-sample:tensorflow-1.5
        command:
        - python
        - tensorflow-sample-code/tfjob/docker/mnist/main.py
        - --max_steps=100000
        - --data_dir=tensorflow-sample-code/data
        resources:
          limits:
            aliyun.com/gpu-mem: 3
        workingDir: /root
      restartPolicy: Never

Then, run inspector to show the GPU memory

❯ kubectl inspect cgpu
NAME                                               IPADDRESS       GPU0(Allocated/Total)  GPU Memory(GiB)
ip-192-168-80-151.cn-northwest-1.compute.internal  192.168.80.151  0/15                   0/15
ip-192-168-87-86.cn-northwest-1.compute.internal   192.168.87.86   3/15                   3/15
-----------------------------------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
3/30 (10%)
❯ kubectl logs gpu-share-sample-vrpsj --tail 1
2023-03-23 09:51:02.301985: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)