Project-HAMi / HAMi

Heterogeneous AI Computing Virtualization Middleware
http://project-hami.io/
Apache License 2.0
963 stars 199 forks source link

gpu pod FailedScheduling #629

Closed 3260926861 closed 2 days ago

3260926861 commented 3 days ago

I hope you can give me some suggestions. Thank you

What happened: gpu pod FailedScheduling

What you expected to happen: the status of gpu pod is running How to reproduce it (as minimally and precisely as possible): install the hami according to the install steps, then run the following deployment: creat gpu-pod01.yaml

50f2231c33e39644163bd157c93da76

kubectl apply -f gpu-pod01.yaml

c696655ce63f4183b6565281aa5363c

kubectl describe pod gpu-pod01

a174fd521aa234c685a60eb023ba483

if I don't request gpu mem: creat gpu-pod02.yaml

d013f2f99cd3fc93f6903eb94749977

kubectl describe pod gpu-pod01

a365a2a295cb95c271a7e75916b552d

Anything else we need to know?:

Environment:

Nimbus318 commented 2 days ago

目前看起来是因为 API-Server 调用 webhook 出了问题,所以 Pod 的调度走了 default-scheduler,没有走 hami-scheduler,会导致调度失败,这个 webhook 的服务提供方也是 hami-scheduler 这个 Pod,这个 Pod 看起来是 Running 的,应该没有问题,所以目前排查思路就是网络相关的问题,就是需要搞清楚为什么 APIServer 没有调通 webhook