Closed mynicolas closed 12 months ago
{"level":"warn","ts":"2023-12-01T20:02:19.107958+0800","logger":"client","caller":"v3@v3.5.9/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000348a80/10.111.0.101:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} https://10.111.0.101:2379 is unhealthy: failed to commit proposal: Unable to fetch the alarm list Error: unhealthy cluster https://10.111.0.102:2379 is healthy: successfully committed proposal: took = 582.942526ms {"level":"warn","ts":"2023-12-01T20:02:25.090909+0800","logger":"client","caller":"v3@v3.5.9/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003daa80/10.111.0.103:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} https://10.111.0.103:2379 is unhealthy: failed to commit proposal: Unable to fetch the alarm list Error: unhealthy cluster
又测了几次,服务器重启之后,一切正常 [root@k8s-master-0 ~]# for ip in ${NODE_IPS}; do ETCDCTL_API=3 etcdctl --endpoints=https://${ip}:2379 --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem endpoint health; done https://10.111.0.101:2379 is healthy: successfully committed proposal: took = 6.550412ms https://10.111.0.102:2379 is healthy: successfully committed proposal: took = 7.126457ms https://10.111.0.103:2379 is healthy: successfully committed proposal: took = 6.521575ms
过了大概几分钟,三台cpu开始爆表,etcd就不健康了,kube-apiserver开始提示连不上etcd
我是空集群,刚刚装了基础服务 [root@ops ~]# kubectl -n kube-system get po NAME READY STATUS RESTARTS AGE calico-kube-controllers-86b55cf789-pkgvk 1/1 Running 0 13m calico-node-8m9rl 1/1 Running 0 11m calico-node-ffg2d 1/1 Running 0 12m calico-node-lvkbw 1/1 Running 0 14m calico-node-n7z7b 1/1 Running 0 14m coredns-7bc88ddb8b-j6sln 1/1 Running 0 11m csi-nfs-controller-556f767bc7-lxbkw 4/4 Running 0 2m26s csi-nfs-node-5vtkb 3/3 Running 0 11m csi-nfs-node-8kmb7 3/3 Running 0 11m csi-nfs-node-cq479 3/3 Running 0 11m csi-nfs-node-gwhz7 3/3 Running 0 11m metrics-server-dfb478476-jk7ms 1/1 Running 0 11m node-local-dns-6rrjr 1/1 Running 0 10m node-local-dns-bpl8j 1/1 Running 0 10m node-local-dns-dnddg 1/1 Running 0 11m node-local-dns-r5m74 1/1 Running 0 10m
已解决,文件句柄数不足导致
What happened? 发生了什么问题?
服务器刚启动之后,一切正常,运行一段时间之后,cpu开始100%,kube-*和etcd服务占用cpu过高,etcd集群异常,报错如下:
What did you expect to happen? 期望的结果是什么?
怀疑是etcd异常导致的cpu过高,希望能帮忙给一下排查方式
How can we reproduce it (as minimally and precisely as possible)? 尽可能最小化、精确地描述如何复现问题
[root@k8s-master-0 ~]# for ip in ${NODE_IPS}; do
Anything else we need to know? 其他需要说明的情况
No response
Kubernetes version k8s 版本
Kubeasz version
OS version 操作系统版本
Related plugins (CNI, CSI, ...) and versions (if applicable) 其他网络插件等需要说明的情况