Closed GuHeeM closed 2 months ago
I've seen your email, please check if the scheduler is up and running, curl {scheduler node ip}:31993/metrics, and see if you get the overview of the devices in cluster
I've seen your email, please check if the scheduler is up and running, curl {scheduler node ip}:31993/metrics, and see if you get the overview of the devices in cluster
I checked the curl metric and got some overview of my devices, like a device of 8-GPUs node:
however, the all the outputs have the" deviceidx is "0""
But i curl the master and the other worker node: curl failed to connect : 连接超时
solved by adding schedulerName:hami-scheduler, thanks to the support team.
After installing HAMi, I deployed a gpu pod, got: Failed calling webhook, failing open vgpu.hami.io: failed calling webhook "vgpu.hami.io": failed to call webhook: Post "https://hami-scheduler.kube-system.svc:443/webhook?timeout=30s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) E0707 11:53:33.346585 1 dispatcher.go:180] failed calling webhook "vgpu.hami.io": failed to call webhook: Post "https://hami-scheduler.kube-system.svc:443/webhook?timeout=30s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers).
Through: kubectl get po -n kube-system, the hami-device-plugin pods are running.
Capacity: cpu: 24 ephemeral-storage: 597903888Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 32545252Ki nvidia.com/gpu: 0 pods: 110 Allocatable: cpu: 24 ephemeral-storage: 551028222269 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 32442852Ki nvidia.com/gpu: 0 pods: 110
Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits cpu 850m (3%) 0 (0%) memory 240Mi (0%) 340Mi (1%) nvidia.com/gpu 0
but has the following error for the gpu-pod
How to solve this problem?