JolyonJian / DRS

A Deep Reinforcement Learning enhanced Kubernetes Scheduler for Microservice-based System
16 stars 5 forks source link

The status of 'drs-scheduler' is always pending. #1

Open XimuLyu opened 1 year ago

XimuLyu commented 1 year ago

在部署drs-schedulerpod之后发现这个pod的状态一直是pending 运行kubectl get pods --all-namespaces

NAMESPACE      NAME                            READY   STATUS    RESTARTS   AGE
...
kube-system    my-scheduler-fd7d7fc97-2krp5    0/1     Pending   0          14s

然后sudo kubectl describe pod my-scheduler-fd7d7fc97-2krp5 -n kube-system 得到

Name:           my-scheduler-fd7d7fc97-2krp5
Namespace:      kube-system
Priority:       0
Node:           <none>
Labels:         component=scheduler
                pod-template-hash=fd7d7fc97
                tier=control-plane
                version=second
Annotations:    <none>
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/my-scheduler-fd7d7fc97
Containers:
  kube-second-scheduler:
    Image:      jolyonjian/my-scheduler:1.0
    Port:       <none>
    Host Port:  <none>
    Command:
      /usr/local/bin/kube-scheduler
      --config=/etc/kubernetes/my-scheduler/my-scheduler-config.yaml
    Requests:
      cpu:        100m
    Liveness:     http-get https://:10259/healthz delay=15s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get https://:10259/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/kubernetes/my-scheduler from config-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zxcgq (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      my-scheduler-config
    Optional:  false
  kube-api-access-zxcgq:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              myscheduler=myscheduler
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  55s (x6 over 6m11s)  default-scheduler  0/5 nodes are available: 5 node(s) didn't match Pod's node affinity/selector.

在此之前我已经将master节点也设为可调度,貌似default-scheduler依旧无法正确的调度这个pod。

JolyonJian commented 1 year ago

kubectl describe pod <pod-name> -n kube-system | grep -E '(Node-Selectors|Tolerations)'命令可以查看pod的亲和性规则:

Node-Selectors:        myscheduler=myscheduler
Tolerations:           node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                       node.kubernetes.io/unreachable:NoExecute op=Exists for 300s

应该是由于Node-Selectors的设置导致调度失败,删除drs-scheduler.yaml中的nodeSelector设置后重新部署可以正常运行

spec:

  serviceAccountName: my-scheduler

  nodeSelector:

    myscheduler: myscheduler
XimuLyu commented 1 year ago

谢谢您的回复! 根据提示drs-scheduler的pod已经成功运行。 在各个节点运行drs-monitordrs-scheduler部分(python代码)后,我尝试部署cpu.yaml./apply.sh cpu.yaml 另人遗憾的是这个pod的运行状态也一直是pending,我删除了cpu.yaml中的nodeName设置后也无济于事。 它的亲和力规则是这样的

Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s

对此我没有太多头绪

后来我发现dqn.go中有设置http的url

schedulerurl := "http://192.168.1.113:1234/choose"

我并没有修改这部分并生成自己的镜像,这是否是导致pod无法正常调度的主要原因。

JolyonJian commented 1 year ago

您可以继续使用kubectl describe pod <pod-name>查看调度失败的具体原因,但是我觉得您的怀疑是对的

drs schedulerdecision maker之间以及decision makermonitor之间都是通过ip地址进行通信的,您需要根据您的集群配置进行修改(可能涉及到dqn.go, dqn.py, monitor.py),对dqn.go进行修改后需要您自己制作scheduler的镜像,具体过程可以参考https://kubernetes.io/docs/tasks/extend-kubernetes/configure-multiple-schedulers/

另外,您可以在dqn.go中加入一些log,部署应用后可以通过kubectl logs <pod_name>查看log,方便您了解调度器的工作状态,便于调试和测试

chengleqi commented 11 months ago

@XimuLyu 你好,我最近也准备复现一下,能否一起交流一下呢?vx:chengleqi6g。 @JolyonJian 作者也可以加我一下