Closed 3260926861 closed 2 days ago
目前看起来是因为 API-Server 调用 webhook 出了问题,所以 Pod 的调度走了 default-scheduler,没有走 hami-scheduler,会导致调度失败,这个 webhook 的服务提供方也是 hami-scheduler 这个 Pod,这个 Pod 看起来是 Running 的,应该没有问题,所以目前排查思路就是网络相关的问题,就是需要搞清楚为什么 APIServer 没有调通 webhook
I hope you can give me some suggestions. Thank you
What happened: gpu pod FailedScheduling
What you expected to happen: the status of gpu pod is running How to reproduce it (as minimally and precisely as possible): install the hami according to the install steps, then run the following deployment: creat gpu-pod01.yaml
kubectl apply -f gpu-pod01.yaml
kubectl describe pod gpu-pod01
if I don't request gpu mem: creat gpu-pod02.yaml
kubectl describe pod gpu-pod01
Anything else we need to know?:
The output of
nvidia-smi -a
on your hostYour docker or containerd configuration file (e.g:
/etc/docker/daemon.json
) Master /etc/docker/daemon.jsonnode1 /etc/docker/daemon.json
The hami-device-plugin container logs
The hami-scheduler container logs
The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet
)Any relevant kernel output lines from
dmesg
Environment:
HAMi version: 2.4.1
nvidia driver or other AI device driver version:
Docker version from
docker version
Docker command, image and tag used
Kernel version from
uname -a
Others: kubectl logs kube-apiserver-k8s-master -n kube-system
curl scheduler node ip:31993/metrics