easzlab / kubeasz

使用Ansible脚本安装K8S集群,介绍组件交互原理,方便直接,不受国内网络环境影响
https://github.com/easzlab/kubeasz
10.54k stars 3.53k forks source link

安装好之后,etcd异常 #1334

Closed mynicolas closed 12 months ago

mynicolas commented 12 months ago

What happened? 发生了什么问题?

服务器刚启动之后,一切正常,运行一段时间之后,cpu开始100%,kube-*和etcd服务占用cpu过高,etcd集群异常,报错如下:

What did you expect to happen? 期望的结果是什么?

怀疑是etcd异常导致的cpu过高,希望能帮忙给一下排查方式

How can we reproduce it (as minimally and precisely as possible)? 尽可能最小化、精确地描述如何复现问题

[root@k8s-master-0 ~]# for ip in ${NODE_IPS}; do

ETCDCTL_API=3 etcdctl \ --endpoints=https://${ip}:2379 \ --cacert=/etc/kubernetes/ssl/ca.pem \ --cert=/etc/kubernetes/ssl/etcd.pem \ --key=/etc/kubernetes/ssl/etcd-key.pem \ endpoint health; done {"level":"warn","ts":"2023-12-01T19:47:11.922177+0800","logger":"client","caller":"v3@v3.5.9/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0000368c0/192.168.1.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.1.1:2379: connect: connection refused\""} https://192.168.1.1:2379 is unhealthy: failed to commit proposal: context deadline exceeded Error: unhealthy cluster {"level":"warn","ts":"2023-12-01T19:47:16.994929+0800","logger":"client","caller":"v3@v3.5.9/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000374a80/192.168.1.2:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.1.2:2379: connect: connection refused\""} https://192.168.1.2:2379 is unhealthy: failed to commit proposal: context deadline exceeded Error: unhealthy cluster {"level":"warn","ts":"2023-12-01T19:47:22.053019+0800","logger":"client","caller":"v3@v3.5.9/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000374e00/192.168.1.3:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.1.3:2379: connect: no route to host\""} https://192.168.1.3:2379 is unhealthy: failed to commit proposal: context deadline exceeded Error: unhealthy cluster

Anything else we need to know? 其他需要说明的情况

No response

Kubernetes version k8s 版本

v1.28.1

Kubeasz version

commit 98823999982328e392c6a3cf164f476061329ed8 (HEAD -> master, origin/v3.6, origin/master, origin/HEAD) Author: Louis Wang Date: Sun Oct 8 09:45:17 2023 +0800 docs: fix the quickStart.md url in network-plugin the quickStart.md url is wrong ,when click in preview page,will get 404 .

OS version 操作系统版本

```console # On Linux: $ cat /etc/os-release NAME="Rocky Linux" VERSION="8.9 (Green Obsidian)" ID="rocky" ID_LIKE="rhel centos fedora" VERSION_ID="8.9" PLATFORM_ID="platform:el8" PRETTY_NAME="Rocky Linux 8.9 (Green Obsidian)" ANSI_COLOR="0;32" LOGO="fedora-logo-icon" CPE_NAME="cpe:/o:rocky:rocky:8:GA" HOME_URL="https://rockylinux.org/" BUG_REPORT_URL="https://bugs.rockylinux.org/" SUPPORT_END="2029-05-31" ROCKY_SUPPORT_PRODUCT="Rocky-Linux-8" ROCKY_SUPPORT_PRODUCT_VERSION="8.9" REDHAT_SUPPORT_PRODUCT="Rocky Linux" REDHAT_SUPPORT_PRODUCT_VERSION="8.9" $ uname -a Linux ops 4.18.0-513.5.1.el8_9.x86_64 #1 SMP Fri Nov 17 03:31:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux ```

Related plugins (CNI, CSI, ...) and versions (if applicable) 其他网络插件等需要说明的情况

mynicolas commented 12 months ago

{"level":"warn","ts":"2023-12-01T20:02:19.107958+0800","logger":"client","caller":"v3@v3.5.9/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000348a80/10.111.0.101:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} https://10.111.0.101:2379 is unhealthy: failed to commit proposal: Unable to fetch the alarm list Error: unhealthy cluster https://10.111.0.102:2379 is healthy: successfully committed proposal: took = 582.942526ms {"level":"warn","ts":"2023-12-01T20:02:25.090909+0800","logger":"client","caller":"v3@v3.5.9/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003daa80/10.111.0.103:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} https://10.111.0.103:2379 is unhealthy: failed to commit proposal: Unable to fetch the alarm list Error: unhealthy cluster

mynicolas commented 12 months ago

又测了几次,服务器重启之后,一切正常 [root@k8s-master-0 ~]# for ip in ${NODE_IPS}; do ETCDCTL_API=3 etcdctl --endpoints=https://${ip}:2379 --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem endpoint health; done https://10.111.0.101:2379 is healthy: successfully committed proposal: took = 6.550412ms https://10.111.0.102:2379 is healthy: successfully committed proposal: took = 7.126457ms https://10.111.0.103:2379 is healthy: successfully committed proposal: took = 6.521575ms

过了大概几分钟,三台cpu开始爆表,etcd就不健康了,kube-apiserver开始提示连不上etcd

mynicolas commented 12 months ago

我是空集群,刚刚装了基础服务 [root@ops ~]# kubectl -n kube-system get po NAME READY STATUS RESTARTS AGE calico-kube-controllers-86b55cf789-pkgvk 1/1 Running 0 13m calico-node-8m9rl 1/1 Running 0 11m calico-node-ffg2d 1/1 Running 0 12m calico-node-lvkbw 1/1 Running 0 14m calico-node-n7z7b 1/1 Running 0 14m coredns-7bc88ddb8b-j6sln 1/1 Running 0 11m csi-nfs-controller-556f767bc7-lxbkw 4/4 Running 0 2m26s csi-nfs-node-5vtkb 3/3 Running 0 11m csi-nfs-node-8kmb7 3/3 Running 0 11m csi-nfs-node-cq479 3/3 Running 0 11m csi-nfs-node-gwhz7 3/3 Running 0 11m metrics-server-dfb478476-jk7ms 1/1 Running 0 11m node-local-dns-6rrjr 1/1 Running 0 10m node-local-dns-bpl8j 1/1 Running 0 10m node-local-dns-dnddg 1/1 Running 0 11m node-local-dns-r5m74 1/1 Running 0 10m

mynicolas commented 12 months ago

已解决,文件句柄数不足导致