Project-HAMi / HAMi

Heterogeneous AI Computing Virtualization Middleware
http://project-hami.io/
Apache License 2.0
677 stars 155 forks source link

How to solve failed calling webhook? #379

Closed GuHeeM closed 2 months ago

GuHeeM commented 2 months ago

After installing HAMi, I deployed a gpu pod, got: Failed calling webhook, failing open vgpu.hami.io: failed calling webhook "vgpu.hami.io": failed to call webhook: Post "https://hami-scheduler.kube-system.svc:443/webhook?timeout=30s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) E0707 11:53:33.346585 1 dispatcher.go:180] failed calling webhook "vgpu.hami.io": failed to call webhook: Post "https://hami-scheduler.kube-system.svc:443/webhook?timeout=30s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers).

Through: kubectl get po -n kube-system, the hami-device-plugin pods are running.

Capacity: cpu: 24 ephemeral-storage: 597903888Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 32545252Ki nvidia.com/gpu: 0 pods: 110 Allocatable: cpu: 24 ephemeral-storage: 551028222269 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 32442852Ki nvidia.com/gpu: 0 pods: 110

Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits cpu 850m (3%) 0 (0%) memory 240Mi (0%) 340Mi (1%) nvidia.com/gpu 0

but has the following error for the gpu-pod image

How to solve this problem?

archlitchi commented 2 months ago

I've seen your email, please check if the scheduler is up and running, curl {scheduler node ip}:31993/metrics, and see if you get the overview of the devices in cluster

GuHeeM commented 2 months ago

I've seen your email, please check if the scheduler is up and running, curl {scheduler node ip}:31993/metrics, and see if you get the overview of the devices in cluster

I checked the curl metric and got some overview of my devices, like a device of 8-GPUs node:

image

however, the all the outputs have the" deviceidx is "0""

But i curl the master and the other worker node: curl failed to connect : 连接超时

GuHeeM commented 2 months ago

solved by adding schedulerName:hami-scheduler, thanks to the support team.