Closed phosae closed 4 years ago
logs of the etcd instances would be helpful here.
logs of the etcd instances would be helpful here.
about 2019-08-09 10:50, the cluster switch to unstable mode, leader node changes frequently, as seeing log detail below. we restart cluster (include the gateway) around 11:50, and later we cut off all the clients, the log goes normal, but cpu and virtual memory usage keep unbelievable high.
we enable debug log on etcd1 :
----------------------------------lots of these exception logs-------------------------------------
2019-08-09 10:57:50.500232 [pkg_logger.go:114] W | etcdserver: request "header:
2019-08-09 11:54:06.674422 [pkg_logger.go:147] D | etcdserver/api/v2http: [GET] /metrics remote:prometheus:40558 2019-08-09 11:54:21.674427 [pkg_logger.go:147] D | etcdserver/api/v2http: [GET] /metrics remote:prometheus:40558 2019-08-09 11:54:36.674458 [pkg_logger.go:147] D | etcdserver/api/v2http: [GET] /metrics remote:prometheus:40558 2019-08-09 11:54:51.674488 [pkg_logger.go:147] D | etcdserver/api/v2http: [GET] /metrics remote:prometheus:40558 2019-08-09 11:55:06.674434 [pkg_logger.go:147] D | etcdserver/api/v2http: [GET] /metrics remote:prometheus:40558 2019-08-09 11:55:21.637016 [pkg_logger.go:147] D | auth: jwt token: eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJyZXZpc2lvbiI6MjQsInVzZXJuYW1lIjoicm9vdCJ9.qpsWIpNOLIaJmrc4l3Y2__uI9S9hAwGwpKo7sEj2iRcx2l6rMZKlP6LeKmlGKx91t-R5rCE56EqhbHD0W9h2KYwxexnLAE3BIDYsOVcisT7TkkGfmC1H-47PY7wq7a3Xq3Aitp4pTubEFLrL1Vph2ABwJQrn3No7pg9-Lid5_WezXeXpR4noFyX_fmcm7ZrhK6PoO_M2yMSWr4LjhgDs8tM7odDskR5R-EJI-P7QbtI0Hp8a549069Xzv70x0njuF_YoB-bgY9dwXvMUX609VSGHX-oyMzRfb6c559LM_iheaR2ShLqA8t6D4KpW8_OWgxBs8JTyKdOveu2NAhuqOQC8-uibGIE3VHWMagyEwXMbbfFkf24V9ksFkLus8MIWSFyu3xvq1mFcgwqxOZCAl1_uUGK8IZFVm6sY9k1Knvzd2xotIRsA4F1brA7tQjtU9v4C6CbVHbRXWXJGtOQaOIC5n7eH9Sobe9R54isClTlLVuZgLnxWV0GPHEc_cEmP5nElC_7GpKRBhWFO9Jp2W7bzcYG2atQfALenxriwXcy2EzLh5NEgPC5WWs9e_ImWvyfh-ehFMj-QZAtngBPbMg9OKNbcr0R46ReR6oIL0n-1-Afh_7AqqGpgeslGjPbl56RRJM4fA-sCEK6aw29pYGCMgZWdmWA4O7wv-R25ppI
--------error log on etcd 2, lots of go routine error--------- goroutine 50862 [select, 1 minutes]: runtime.gopark(0x5633e10783f0, 0x0, 0x5633e09fda66, 0x6, 0x18, 0x1) /opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/runtime/proc.go:291 +0x120 fp=0xc46b51f678 sp=0xc46b51f658 pc=0x5633e00cefb0 runtime.selectgo(0xc46b51f8e0, 0xc46b0e5080) /opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/runtime/select.go:392 +0xe56 fp=0xc46b51f850 sp=0xc46b51f678 pc=0x5633e00ded06 github.com/coreos/etcd/etcdserver/api/v3rpc.(watchServer).Watch(0xc4205f3d00, 0x5633e108f560, 0xc46aa79070, 0x5633e107caa0, 0xc4205f3d00) /builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/src/github.com/coreos/etcd/etcdserver/api/v3rpc/watch.go:151 +0x39b fp=0xc46b51fa08 sp=0xc46b51f850 pc=0x5633e08f543b github.com/coreos/etcd/etcdserver/etcdserverpb._Watch_Watch_Handler(0x5633e0ed9460, 0xc4205f3d00, 0x5633e108e180, 0xc46b388540, 0x5633e00b2c4a, 0x20) /builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/src/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go:3481 +0xb4 fp=0xc46b51fa50 sp=0xc46b51fa08 pc=0x5633e04e0054 github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1(0x5633e0ed9460, 0xc4205f3d00, 0x5633e108e180, 0xc46b388540, 0x19, 0xc46b40afa0) /builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/src/github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:71 +0xf7 fp=0xc46b51faa8 sp=0xc46b51fa50 pc=0x5633e08c8c47 github.com/grpc-ecosystem/go-grpc-prometheus.(ServerMetrics).StreamServerInterceptor.func1(0x5633e0ed9460, 0xc4205f3d00, 0x5633e108e360, 0xc46b2e6800, 0xc46b388520, 0xc46b40af50, 0x5633e15dd638, 0xc46aeb42c0) /builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/Godeps/_workspace/src/github.com/grpc-ecosystem/go-grpc-prometheus/server_metrics.go:125 +0x12c fp=0xc46b51fb40 sp=0xc46b51faa8 pc=0x5633e08cc8ac github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1(0x5633e0ed9460, 0xc4205f3d00, 0x5633e108e360, 0xc46b2e6800, 0x5633e15de820, 0xc46aeb6300) /builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/src/github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:74 +0x9e fp=0xc46b51fb98 sp=0xc46b51fb40 pc=0x5633e08c8bee github.com/coreos/etcd/etcdserver/api/v3rpc.newStreamInterceptor.func1(0x5633e0ed9460, 0xc4205f3d00, 0x5633e108e360, 0xc46b2e6800, 0xc46b388520, 0xc46b40af50, 0x0, 0x0) /builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/src/github.com/coreos/etcd/etcdserver/api/v3rpc/interceptor.go:212 +0xf4 fp=0xc46b51fc28 sp=0xc46b51fb98 pc=0x5633e08f7c04 github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1(0x5633e0ed9460, 0xc4205f3d00, 0x5633e108e360, 0xc46b2e6800, 0xc46b388520, 0x5633e1075158, 0x5633e04e0164, 0xc460c10bc0) /builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/src/github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:79 +0x168 fp=0xc46b51fc98 sp=0xc46b51fc28 pc=0x5633e08c8de8 google.golang.org/grpc.(Server).processStreamingRPC(0xc4200bb040, 0x5633e108f140, 0xc42140f4a0, 0xc46ae95d00, 0xc4205474a0, 0x5633e15681c0, 0x0, 0x0, 0x0) /builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/Godeps/_workspace/src/google.golang.org/grpc/server.go:957 +0x2cf fp=0xc46b51fdc8 sp=0xc46b51fc98 pc=0x5633e04b315f google.golang.org/grpc.(Server).handleStream(0xc4200bb040, 0x5633e108f140, 0xc42140f4a0, 0xc46ae95d00, 0x0) /builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/Godeps/_workspace/src/google.golang.org/grpc/server.go:1044 +0x12b3 fp=0xc46b51ff80 sp=0xc46b51fdc8 pc=0x5633e04b4cb3 google.golang.org/grpc.(Server).serveStreams.func1.1(0xc421783690, 0xc4200bb040, 0x5633e108f140, 0xc42140f4a0, 0xc46ae95d00) /builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/Godeps/_workspace/src/google.golang.org/grpc/server.go:589 +0xa1 fp=0xc46b51ffb8 sp=0xc46b51ff80 pc=0x5633e04bad71 runtime.goexit() /opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc46b51ffc0 sp=0xc46b51ffb8 pc=0x5633e00fcad1 created by google.golang.org/grpc.(Server).serveStreams.func1 /builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/Godeps/_workspace/src/google.golang.org/grpc/server.go:587 + --------------------etcd3 just have warning log------------------------- WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:50:34 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:50:34 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:50:34 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:50:34 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport
log maybe the result, not the reason。you can perf top -p pid to deal with cpu hot trouble。
Can some body help this issue? perf top output are here: $ perf top --sort comm,dso 99.34% etcd etcd 0.24% perf perf 0.10% perf [kernel] $ perf top 9.33% etcd [.] 0x00000000006315cd 8.77% etcd [.] _start 2.54% etcd [.] 0x0000000000640a81 2.49% etcd [.] 0x000000000064081f 2.40% etcd [.] 0x0000000000640b48 2.19% etcd [.] 0x0000000000640ad1 2.19% etcd [.] 0x0000000000640aba 2.16% etcd [.] 0x0000000000640a97 2.11% etcd [.] 0x0000000000640824 2.09% etcd [.] 0x0000000000640b5e 2.03% etcd [.] 0x0000000000640b7e 1.94% etcd [.] 0x0000000000640aa3 1.94% etcd [.] 0x0000000000640ac5 1.89% etcd [.] 0x0000000000640811 1.88% etcd [.] 0x0000000000640807 1.83% etcd [.] 0x0000000000640a8c 1.80% etcd [.] 0x0000000000640aae 1.45% etcd [.] 0x0000000000640b6b 1.44% etcd [.] 0x0000000000640815 1.41% etcd [.] 0x000000000064081a 1.33% etcd [.] 0x0000000000640a7a 1.24% etcd [.] 0x0000000000640b53 1.15% etcd [.] 0x0000000000640b50 1.02% etcd [.] 0x0000000000640a84 1.01% etcd [.] 0x00000000006407fe 1.01% etcd [.] 0x0000000000640b4c 1.00% etcd [.] 0x00000000006407eb 0.91% etcd [.] 0x0000000000640b45 0.88% etcd [.] 0x0000000000640a9e 0.88% etcd [.] 0x000000000064080c 0.84% etcd [.] 0x0000000000640b57 0.81% [kernel] [k] softirqentry_text_start 0.74% etcd [.] 0x0000000000640b76 0.72% etcd [.] 0x00000000006409c0 0.70% etcd [.] 0x0000000000640b6f 0.65% etcd [.] 0x00000000006409c4 0.64% etcd [.] 0x00000000006407f9 0.63% etcd [.] 0x0000000000640802 0.60% etcd [.] 0x0000000000640b79 0.60% etcd [.] 0x0000000000640a88 0.59% etcd [.] 0x0000000000640a9a 0.58% [kernel] [k] run_timer_softirq 0.56% etcd [.] 0x0000000000640b66 0.54% etcd [.] 0x0000000000640ab5 0.54% etcd [.] 0x0000000000640acc 0.53% etcd [.] 0x0000000000640b72 0.49% etcd [.] 0x000000000063147c 0.41% etcd [.] 0x0000000000640a7e 0.41% etcd [.] 0x00000000006407e8 0.38% etcd [.] 0x0000000000640ab1 0.35% etcd [.] 0x0000000000640a94 0.34% etcd [.] 0x0000000000640a22 0.32% etcd [.] 0x000000000062fdfd 0.31% etcd [.] 0x00000000006315eb 0.30% etcd [.] 0x0000000000640ac8 0.30% etcd [.] 0x00000000006313cf 0.30% etcd [.] 0x0000000000640b63 0.29% etcd [.] 0x0000000000640a26 0.28% [kernel] [k] lock_text_start 0.27% etcd [.] 0x00000000006313d4 0.27% etcd [.] 0x0000000000631528 0.27% etcd [.] 0x000000000062fee6 0.26% etcd [.] 0x000000000057b1c7 0.25% etcd [.] 0x0000000000631457 0.24% etcd [.] 0x00000000006407ef 0.19% etcd [.] 0x000000000063140a 0.18% etcd [.] 0x000000000063155b 0.17% etcd [.] 0x0000000000640b41
$ perf top -g Children Self Shared Object Symbol
As a sanity check, make sure all 3 etcd servers are up and running, and they are connected to each other.
As a sanity check, make sure all 3 etcd servers are up and running, and they are connected to each other.
Every time we restart the cluster, each node are up and running and they are connected to each other. But several day later, CPU overload problem will make node failed to receive heart beat from each other.
The only thing we can do is restarting the cluster, but it can't help decreasing the high cpu and virtual memory usage.
this bug fixed in [#10218]
thx, this problem doesn't reappear since we update our etcd cluster.
@weopm May I know if etcd 3.3.12 still has the issue? We have similar high CPU usage issue on etcd 3.3.12. Thanks!
this bug is still exit on etcd 3.3.18. Maybe you can do a patch with (https://github.com/etcd-io/etcd/pull/10218/files#diff-a5a4bca15b031f18356513fe1382c3c7L560) or update to etcd 4.4.3.
@weopm thanks for the quick reply, to double confirm, should it be 3.4.3 has the fix?
this bug is fixed on etcd 4.4.3, not 3.4.3. you can review and confirm the code by yourself with (https://github.com/etcd-io/etcd/pull/10218/files#diff-a5a4bca15b031f18356513fe1382c3c7L560).
@weopm - We're seeing this issue in production workloads on 3.3.12. I noticed that there was a cherry-pick commit for 3.4, which was merged, and a cherry-pick for 3.3, which was closed and not merged.
Is it possible to get this fix into 3.3 and create a new patch release?
I do see it's in the v3.4.3 tag as well: https://github.com/etcd-io/etcd/blob/v3.4.3/etcdserver/v3_server.go#L607
PR to cherry-pick into 3.3: https://github.com/etcd-io/etcd/pull/11378
system aws ec2, Linux version 4.14.128-112.105 x86_64 etcd version 3.3.11
start up configuration
top info
this happen the next day after I change root passwd through gateway: echo "newpass"|etcdctl user passwd root --user="root:root" --interactive=false --endpoints="http://gateway:23790" (though this may not the direct cause)