Project-HAMi / HAMi

Heterogeneous AI Computing Virtualization Middleware
http://project-hami.io/
Apache License 2.0
957 stars 197 forks source link

Hami schedulter filtering issue #538

Closed rajeeshckr closed 1 month ago

rajeeshckr commented 1 month ago

Please provide an in-depth description of the question you have: I was experimenting with Nvidia g4dn.xlarge instance type with HAMi device plugin Daemonset with these options set

image

The scheduler extension config is

image

I expected 3 GPU pods to get scheduled on the 1GPU node. 4th one is failing as expected, but my issue is, the scheduler is filtering the same node which has exhausted the vGPU devices. And we get error from kubelet.

image

I was expecting the hami-scheduler to not to filter this node and put the pod to Pending state.

Here is resources listed in the node.

image

Is there something we need to do to get to that behavior? Thanks!

What do you think about this question?:

Environment:

rajeeshckr commented 1 month ago

@wawa0210 tagging you on this since you seems to be replying some of the questions recently. Could you please help me with this when you get a chance? 🙇🏼

wawa0210 commented 1 month ago

@wawa0210 tagging you on this since you seems to be replying some of the questions recently. Could you please help me with this when you get a chance? 🙇🏼

Do you mean that in this scenario, if HAMi schedules the fourth pod, you expect the status to be Pending instead of ContainerStatusUnknown?

Because there is no node available for scheduling for this POD at this time, the ip-172-30.xxxx-us-west-2-compute-internal node should be directly filtered out and not participate in the filter step


According to the test situation described, single node single GPU, --device-split-count=3, I created four pods, and finally one pod was Pending. Is this within expectations?

[root@controller-node-1 ~]# kubectl get po -o wide| grep 'test-vgpu'
test-vgpu-5b87958dd-2t7xg                      1/1     Running       0                18m   10.233.74.118   controller-node-1   <none>           <none>
test-vgpu-5b87958dd-btd9q                      1/1     Running       0                18m   10.233.74.116   controller-node-1   <none>           <none>
test-vgpu-5b87958dd-cllkw                      0/1     Pending       0                96s   <none>          <none>              <none>           <none>
test-vgpu-5b87958dd-v6grx                      1/1     Running       0                18m   10.233.74.87    controller-node-1   <none>           <none>

decribe the pending pod

[root@controller-node-1 ~]# kubectl describe po test-vgpu-5b87958dd-cllkw
Name:             test-vgpu-5b87958dd-cllkw
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           app=test-vgpu
                  pod-template-hash=5b87958dd
Annotations:      <none>
Status:           Pending
IP:
IPs:              <none>
Controlled By:    ReplicaSet/test-vgpu-5b87958dd
Containers:
  container-1:
    Image:      docker.samzong.me/chrstnhntschl/gpu_burn
    Port:       <none>
    Host Port:  <none>
    Args:
      4000
    Limits:
      cpu:                  250m
      memory:               512Mi
      nvidia.com/gpucores:  10
      nvidia.com/gpumem:    2k
      nvidia.com/vgpu:      1
    Requests:
      cpu:                  250m
      memory:               512Mi
      nvidia.com/gpucores:  10
      nvidia.com/gpumem:    2k
      nvidia.com/vgpu:      1
    Environment:
      CUDA_TASK_PRIORITY:  0
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8f2xq (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  kube-api-access-8f2xq:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age    From            Message
  ----     ------            ----   ----            -------
  Warning  FailedScheduling  4m28s  hami-scheduler  0/2 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/unreachable: }. preemption: 0/2 nodes are available: 1 No preemption victims found for incoming pod, 1 Preemption is not helpful for scheduling.
  Warning  FilteringFailed   4m28s  hami-scheduler  no available node, all node scores do not meet
rajeeshckr commented 1 month ago

Thanks for looking into it. Looks like for you it's working as expected. I wanted my fourth pod to be in Pending state. Do I need to tweak something in hami-scheduler?

rajeeshckr commented 1 month ago

in my case, I am using nvidia.com/gpu as per examples here. I see nvidia.com/vgpu in your case, but that could just be device plugin config difference

wawa0210 commented 1 month ago

Can you paste the pod yaml

Another question, how do you install HAMi? A new one or upgrade from other version

rajeeshckr commented 1 month ago

This is a fresh install for a POC, Here is the helm values file I used values.yaml.txt Here is the deployed yaml hami.yaml.txt Here is the pod spec test_workload_rajeesh_test.yml.txt

I noticed this bit in the scheduler config is it because of that?

image
wawa0210 commented 1 month ago

This is a fresh install for a POC, Here is the helm values file I used values.yaml.txt Here is the deployed yaml hami.yaml.txt Here is the pod spec test_workload_rajeesh_test.yml.txt

I noticed this bit in the scheduler config is it because of that? image

Everything seems to be normal, and there is nothing to pay attention to. You can try the following steps

  1. Restart hami-scheduler and prepare to capture logs

  2. Reduce the number of copies of your application to 0, and then expand it to 4

  3. If you find a pod with ContainerStatusUnknown, remember to upload the hami-scheduler extender container log

  4. Delete some information about hami in the node annotation, restart hami device-plugin, retry step 2, and observe whether it is normal

rajeeshckr commented 1 month ago

Sorry for the confusion, trying this on a brand new node fixed the issue. Looks like something was in a weird state in my old node. Thanks for having a look!