AliyunContainerService / gpushare-scheduler-extender

GPU Sharing Scheduler for Kubernetes Cluster
Apache License 2.0
1.36k stars 303 forks source link

Integration with Rancher #163

Open 201508876PMH opened 2 years ago

201508876PMH commented 2 years ago

Hello guys, i'm having a hard time finding the kube-scheduler.yaml file as i need to modify it. People say it should be under the manifests folder, but i dont have one?

Looking at the image, i have a hanging pod. Any suggestions would be much appreciated image

wsxiaozhang commented 2 years ago

@201508876PMH pls refer to this https://github.com/rancher/rke/issues/1841 to config your RKE scheduler. for the hanging gpushare-sched-extender pod, could you pls describe that pod and share more log of the pod?

201508876PMH commented 2 years ago

Thank you for answering me! :) I had a look at the closed issue and found it helpful (somewhat). I'm having trouble finding the specific .yaml file to configure.

image

And to answer your question, i believe the pod hangs because i havnt modified the scheduler configuration yet :)

201508876PMH commented 2 years ago

@wsxiaozhang update! i managed to edit the cluster yaml file. Inspecting the container i can see the correct added arguments:

image

Instead of mounting a new folder, i just threw it into the already mounted /etc/kubernetes/ssl folder. However, my pod still hangs! :(

631068264 commented 2 years ago

get Error: unknown flag: --policy-config-file

southquist commented 2 years ago

@631068264 What version of kubernetes are you running? I belive --policy-config-file is deprecated in k8s 1.23. This might help: https://github.com/AliyunContainerService/gpushare-scheduler-extender/issues/166

1003111014 commented 2 years ago

How did you modify the parameters of this container? My version of rancher2.5 does not update the entrance of yaml.

631068264 commented 1 year ago

@southquist use

services:
    scheduler:
      extra_args:
        config: /etc/kubernetes/scheduler-policy-config.yaml

but kubescheduler can't start get error

{"log":"I0202 02:01:24.426117       1 requestheader_controller.go:244] Loaded a new request header values for RequestHeaderAuthRequestController\n","stream":"stderr","time":"2023-02-02T02:01:24.426433785Z"}
{"log":"E0202 02:01:24.427547       1 run.go:74] \"command failed\" err=\"stat /etc/kubernetes/scheduler.conf: no such file or directory\"\n","stream":"stderr","time":"2023-02-02T02:01:24.427672696Z"}
631068264 commented 1 year ago

Oh change /etc/kubernetes/scheduler.conf to /etc/kubernetes/ssl/kubecfg-kube-scheduler.yaml in /etc/kubernetes/scheduler-policy-config.yaml is Ok