3 nodes etcd cluster with two node in high cpu usage

phosae commented 5 years ago

system aws ec2, Linux version 4.14.128-112.105 x86_64 etcd version 3.3.11

start up configuration

etcd --name m1 --initial-advertise-peer-urls http://etcd1:2380 \
--listen-peer-urls http://etcd1:2380 \
--listen-client-urls http://etcd1:2379 \ 
--advertise-client-urls http://etcd1:2379 --debug=true --log-output=stdout \
--auth-token=jwt,pub-key=./jwt_RS256.pub,priv-key=./jwt_RS256.rsa, \
sign-method=RS256 --data-dir /data/etcd --initial-cluster-token my-cluster\
--initial-cluster m1=http://etcd1:2380,m2=http://etcd2:2380,m3=http://etcd3:2380 \
--initial-cluster-state new

m2, m2 startup shell are similar to m1

# we use gateway
/usr/local/bin/etcd gateway start --endpoints=etcd1:2379,etcd2:2379,etcd3:2379

top info

PID   USER  PR   NI  VIRT   RES   SHR    S  %CPU   %MEM  TIME+      COMMAND
10297 root  20   0   11.2g  94708 17812  R  394.0  0.6   963:57.14  etcd

PID   USER  PR   NI  VIRT   RES    SHR   S  %CPU   %MEM  TIME+    COMMAND
7787  root  20   0   11.2g  91764  17760 S  393.4  0.6   1019:19  etcd

PID   USER   PR  NI   VIRT   RES    SHR   S   %CPU  %MEM  TIME+   COMMAND
12748 root   20   0   11.1g  55472  17724 S   0.3   0.3   1:31.41 etcd

(pprof) top10
Showing nodes accounting for 65960ms, 87.63% of 75270ms total
Dropped 349 nodes (cum <= 376.35ms)
Showing top 10 nodes out of 72
      flat  flat%   sum%        cum   cum%
   18990ms 25.23% 25.23%    18990ms 25.23%  math/big.mulAddVWW
   14630ms 19.44% 44.67%    14630ms 19.44%  math/big.addMulVVW
   12920ms 17.16% 61.83%    12920ms 17.16%  math/big.subVV
   12220ms 16.23% 78.07%    47160ms 62.65%  math/big.nat.divLarge
    2470ms  3.28% 81.35%    17210ms 22.86%  math/big.basicSqr
    1150ms  1.53% 82.87%     1980ms  2.63%  runtime.scanobject
    1100ms  1.46% 84.34%     1100ms  1.46%  math/big.shlVU
     890ms  1.18% 85.52%      890ms  1.18%  runtime.memclrNoHeapPointers
     820ms  1.09% 86.61%      820ms  1.09%  math/big.greaterThan (inline)
     770ms  1.02% 87.63%      770ms  1.02%  math/big.shrVU

this happen the next day after I change root passwd through gateway: echo "newpass"|etcdctl user passwd root --user="root:root" --interactive=false --endpoints="http://gateway:23790" (though this may not the direct cause)

hexfusion commented 5 years ago

logs of the etcd instances would be helpful here.

phosae commented 5 years ago

logs of the etcd instances would be helpful here.

about 2019-08-09 10:50, the cluster switch to unstable mode, leader node changes frequently, as seeing log detail below. we restart cluster (include the gateway) around 11:50, and later we cut off all the clients, the log goes normal, but cpu and virtual memory usage keep unbelievable high.

we enable debug log on etcd1 : ----------------------------------lots of these exception logs------------------------------------- 2019-08-09 10:57:50.500232 [pkg_logger.go:114] W | etcdserver: request "header: authenticate:" with result "size:796" too k too long (331.139139ms) to execute 2019-08-09 10:57:51.131115 [pkg_logger.go:114] W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 71.742296ms) 2019-08-09 10:57:51.131153 [pkg_logger.go:114] W | etcdserver: server is likely overloaded 2019-08-09 10:57:51.131165 [pkg_logger.go:114] W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 71.800355ms) 2019-08-09 10:57:51.131173 [pkg_logger.go:114] W | etcdserver: server is likely overloaded 2019-08-09 10:57:51.644504 [pkg_logger.go:114] W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 123.516079ms) 2019-08-09 10:57:51.644542 [pkg_logger.go:114] W | etcdserver: server is likely overloaded 2019-08-09 10:57:51.644554 [pkg_logger.go:114] W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 123.578241ms) 2019-08-09 10:57:51.644562 [pkg_logger.go:114] W | etcdserver: server is likely overloaded 2019-08-09 10:57:51.758637 [pkg_logger.go:147] D | etcdserver/api/v2http: [GET] /metrics remote:prometheus:19754 2019-08-09 10:57:52.253125 [pkg_logger.go:147] D | auth: jwt token: eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJyZXZpc2lvbiI6MjQsInVzZXJuYW1lIjoicm9vdCJ9.qpsWIpNOLIaJmrc4l3Y2 uI9S9hAwGwpKo7sEj2iRcx2l6rMZKlP6LeKmlGKx91t-R5rCE56EqhbHD0W9h2KYwxexnLAE3BIDYsOVcisT7TkkGfmC1H-47PY7wq7a3Xq3Aitp4pTubEFLrL1Vph2ABwJQrn3No7pg9-Lid5_WezXeXpR4noFyX_fmcm7ZrhK6 PoO_M2yMSWr4LjhgDs8tM7odDskR5R-EJI-P7QbtI0Hp8a549069Xzv70x0njuF_YoB-bgY9dwXvMUX609VSGHX-oyMzRfb6c559LM_iheaR2ShLqA8t6D4KpW8_OWgxBs8JTyKdOveu2NAhuqOQC8-uibGIE3VHWMagyEwXMbbf Fkf24V9ksFkLus8MIWSFyu3xvq1mFcgwqxOZCAl1_uUGK8IZFVm6sY9k1Knvzd2xotIRsA4F1brA7tQjtU9v4C6CbVHbRXWXJGtOQaOIC5n7eH9Sobe9R54isClTlLVuZgLnxWV0GPHEc_cEmP5nElC_7GpKRBhWFO9Jp2W7bzcY G2atQfALenxriwXcy2EzLh5NEgPC5WWs9e_ImWvyfh-ehFMj-QZAtngBPbMg9OKNbcr0R46ReR6oIL0n-1-Afh_7AqqGpgeslGjPbl56RRJM4fA-sCEK6aw29pYGCMgZWdmWA4O7wv-R25ppI 2019-08-09 10:57:52.253169 [pkg_logger.go:147] D | auth: authorized root, token is eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJyZXZpc2lvbiI6MjQsInVzZXJuYW1lIjoicm9vdCJ9.qpsWIpN OLIaJmrc4l3Y2uI9S9hAwGwpKo7sEj2iRcx2l6rMZKlP6LeKmlGKx91t-R5rCE56EqhbHD0W9h2KYwxexnLAE3BIDYsOVcisT7TkkGfmC1H-47PY7wq7a3Xq3Aitp4pTubEFLrL1Vph2ABwJQrn3No7pg9-Lid5_WezXeXpR4noFyX_fmcm7ZrhK6PoO_M2yMSWr4LjhgDs8tM7odDskR5R-EJI-P7QbtI0Hp8a549069Xzv70x0njuF_YoB-bgY9dwXvMUX609VSGHX-oyMzRfb6c559LM_iheaR2ShLqA8t6D4KpW8_OWgxBs8JTyKdOveu2NAhuqOQC8-uibGIE3VHWMagyEwXMbbfFkf24V9ksFkLus8MIWSFyu3xvq1mFcgwqxOZCAl1_uUGK8IZFVm6sY9k1Knvzd2xotIRsA4F1brA7tQjtU9v4C6CbVHbRXWXJGtOQaOIC5n7eH9Sobe9R54isClTlLVuZgLnxWV0GPHEc_cEmP5nElC_7GpKRBhWFO9Jp2W7bzcYG2atQfALenxriwXcy2EzLh5NEgPC5WWs9e_ImWvyfh-ehFMj-QZAtngBPbMg9OKNbcr0R46ReR6oIL0n-1-Afh_7AqqGpgeslGjPbl56RRJM4fA-sCEK6aw29pYGCMgZWdmWA4O7wv-R25ppI 2019-08-09 10:57:52.253239 [pkg_logger.go:114] W | etcdserver: request "header: authenticate:" with result "size:796" took too long (369.255968ms) to execute ... 2019-08-09 10:59:04.482563 [pkg_logger.go:114] W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 45.105602ms) 2019-08-09 10:59:04.482592 [pkg_logger.go:114] W | etcdserver: server is likely overloaded 2019-08-09 10:59:04.482602 [pkg_logger.go:114] W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 45.154347ms) ... 2019-08-09 11:47:33.391078 [pkg_logger.go:147] D | rafthttp: dropped MsgHeartbeat to e5a61e8de658ec02 since pipeline's sending buffer is full 2019-08-09 11:47:33.391130 [pkg_logger.go:147] D | raft: d9bfe1930e475fc failed to send message to e5a61e8de658ec02 because it is unreachable [next = 7000, match = 0, state = ProgressStateProbe, waiting = true, pendingSnapshot = 0] 2019-08-09 11:47:33.511899 [pkg_logger.go:147] D | rafthttp: dropped MsgHeartbeat to e5a61e8de658ec02 since pipeline's sending buffer is full 2019-08-09 11:47:33.511955 [pkg_logger.go:147] D | raft: d9bfe1930e475fc failed to send message to e5a61e8de658ec02 because it is unreachable [next = 7000, match = 0, state = ProgressStateProbe, waiting = true, pendingSnapshot = 0] 2019-08-09 11:47:33.592380 [pkg_logger.go:147] D | rafthttp: dropped MsgHeartbeat to e5a61e8de658ec02 since pipeline's sending buffer is full 2019-08-09 11:47:33.592432 [pkg_logger.go:147] D | raft: d9bfe1930e475fc failed to send message to e5a61e8de658ec02 because it is unreachable [next = 7000, match = 0, state = ProgressStateProbe, waiting = true, pendingSnapshot = 0] 2019-08-09 11:47:33.713162 [pkg_logger.go:147] D | rafthttp: dropped MsgHeartbeat to e5a61e8de658ec02 since pipeline's sending buffer is full 2019-08-09 11:47:33.713206 [pkg_logger.go:147] D | raft: d9bfe1930e475fc failed to send message to e5a61e8de658ec02 because it is unreachable [next = 7000, match = 0, state = ProgressStateProbe, waiting = true, pendingSnapshot = 0] ----------restart etcd cluster around 11:50, log looks quite normal, but cpu and virtual mem usage still high------------- 2019-08-09 11:50:50.185269 [pkg_logger.go:147] D | rafthttp: failed to dial e5a61e8de658ec02 on stream MsgApp v2 (dial tcp etcd2:2380: connect: connection refused) 2019-08-09 11:50:50.185407 [pkg_logger.go:147] D | rafthttp: failed to dial e5a61e8de658ec02 on stream Message (dial tcp etcd3:2380: connect: connection refused) 2019-08-09 11:50:50.268598 [pkg_logger.go:134] I | raft: d9bfe1930e475fc is starting a new election at term 4366 2019-08-09 11:50:50.268620 [pkg_logger.go:134] I | raft: d9bfe1930e475fc became candidate at term 4367

2019-08-09 11:54:06.674422 [pkg_logger.go:147] D | etcdserver/api/v2http: [GET] /metrics remote:prometheus:40558 2019-08-09 11:54:21.674427 [pkg_logger.go:147] D | etcdserver/api/v2http: [GET] /metrics remote:prometheus:40558 2019-08-09 11:54:36.674458 [pkg_logger.go:147] D | etcdserver/api/v2http: [GET] /metrics remote:prometheus:40558 2019-08-09 11:54:51.674488 [pkg_logger.go:147] D | etcdserver/api/v2http: [GET] /metrics remote:prometheus:40558 2019-08-09 11:55:06.674434 [pkg_logger.go:147] D | etcdserver/api/v2http: [GET] /metrics remote:prometheus:40558 2019-08-09 11:55:21.637016 [pkg_logger.go:147] D | auth: jwt token: eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJyZXZpc2lvbiI6MjQsInVzZXJuYW1lIjoicm9vdCJ9.qpsWIpNOLIaJmrc4l3Y2__uI9S9hAwGwpKo7sEj2iRcx2l6rMZKlP6LeKmlGKx91t-R5rCE56EqhbHD0W9h2KYwxexnLAE3BIDYsOVcisT7TkkGfmC1H-47PY7wq7a3Xq3Aitp4pTubEFLrL1Vph2ABwJQrn3No7pg9-Lid5_WezXeXpR4noFyX_fmcm7ZrhK6PoO_M2yMSWr4LjhgDs8tM7odDskR5R-EJI-P7QbtI0Hp8a549069Xzv70x0njuF_YoB-bgY9dwXvMUX609VSGHX-oyMzRfb6c559LM_iheaR2ShLqA8t6D4KpW8_OWgxBs8JTyKdOveu2NAhuqOQC8-uibGIE3VHWMagyEwXMbbfFkf24V9ksFkLus8MIWSFyu3xvq1mFcgwqxOZCAl1_uUGK8IZFVm6sY9k1Knvzd2xotIRsA4F1brA7tQjtU9v4C6CbVHbRXWXJGtOQaOIC5n7eH9Sobe9R54isClTlLVuZgLnxWV0GPHEc_cEmP5nElC_7GpKRBhWFO9Jp2W7bzcYG2atQfALenxriwXcy2EzLh5NEgPC5WWs9e_ImWvyfh-ehFMj-QZAtngBPbMg9OKNbcr0R46ReR6oIL0n-1-Afh_7AqqGpgeslGjPbl56RRJM4fA-sCEK6aw29pYGCMgZWdmWA4O7wv-R25ppI

--------error log on etcd 2, lots of go routine error--------- goroutine 50862 [select, 1 minutes]: runtime.gopark(0x5633e10783f0, 0x0, 0x5633e09fda66, 0x6, 0x18, 0x1) /opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/runtime/proc.go:291 +0x120 fp=0xc46b51f678 sp=0xc46b51f658 pc=0x5633e00cefb0 runtime.selectgo(0xc46b51f8e0, 0xc46b0e5080) /opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/runtime/select.go:392 +0xe56 fp=0xc46b51f850 sp=0xc46b51f678 pc=0x5633e00ded06 github.com/coreos/etcd/etcdserver/api/v3rpc.(watchServer).Watch(0xc4205f3d00, 0x5633e108f560, 0xc46aa79070, 0x5633e107caa0, 0xc4205f3d00) /builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/src/github.com/coreos/etcd/etcdserver/api/v3rpc/watch.go:151 +0x39b fp=0xc46b51fa08 sp=0xc46b51f850 pc=0x5633e08f543b github.com/coreos/etcd/etcdserver/etcdserverpb._Watch_Watch_Handler(0x5633e0ed9460, 0xc4205f3d00, 0x5633e108e180, 0xc46b388540, 0x5633e00b2c4a, 0x20) /builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/src/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go:3481 +0xb4 fp=0xc46b51fa50 sp=0xc46b51fa08 pc=0x5633e04e0054 github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1(0x5633e0ed9460, 0xc4205f3d00, 0x5633e108e180, 0xc46b388540, 0x19, 0xc46b40afa0) /builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/src/github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:71 +0xf7 fp=0xc46b51faa8 sp=0xc46b51fa50 pc=0x5633e08c8c47 github.com/grpc-ecosystem/go-grpc-prometheus.(ServerMetrics).StreamServerInterceptor.func1(0x5633e0ed9460, 0xc4205f3d00, 0x5633e108e360, 0xc46b2e6800, 0xc46b388520, 0xc46b40af50, 0x5633e15dd638, 0xc46aeb42c0) /builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/Godeps/_workspace/src/github.com/grpc-ecosystem/go-grpc-prometheus/server_metrics.go:125 +0x12c fp=0xc46b51fb40 sp=0xc46b51faa8 pc=0x5633e08cc8ac github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1(0x5633e0ed9460, 0xc4205f3d00, 0x5633e108e360, 0xc46b2e6800, 0x5633e15de820, 0xc46aeb6300) /builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/src/github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:74 +0x9e fp=0xc46b51fb98 sp=0xc46b51fb40 pc=0x5633e08c8bee github.com/coreos/etcd/etcdserver/api/v3rpc.newStreamInterceptor.func1(0x5633e0ed9460, 0xc4205f3d00, 0x5633e108e360, 0xc46b2e6800, 0xc46b388520, 0xc46b40af50, 0x0, 0x0) /builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/src/github.com/coreos/etcd/etcdserver/api/v3rpc/interceptor.go:212 +0xf4 fp=0xc46b51fc28 sp=0xc46b51fb98 pc=0x5633e08f7c04 github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1(0x5633e0ed9460, 0xc4205f3d00, 0x5633e108e360, 0xc46b2e6800, 0xc46b388520, 0x5633e1075158, 0x5633e04e0164, 0xc460c10bc0) /builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/src/github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:79 +0x168 fp=0xc46b51fc98 sp=0xc46b51fc28 pc=0x5633e08c8de8 google.golang.org/grpc.(Server).processStreamingRPC(0xc4200bb040, 0x5633e108f140, 0xc42140f4a0, 0xc46ae95d00, 0xc4205474a0, 0x5633e15681c0, 0x0, 0x0, 0x0) /builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/Godeps/_workspace/src/google.golang.org/grpc/server.go:957 +0x2cf fp=0xc46b51fdc8 sp=0xc46b51fc98 pc=0x5633e04b315f google.golang.org/grpc.(Server).handleStream(0xc4200bb040, 0x5633e108f140, 0xc42140f4a0, 0xc46ae95d00, 0x0) /builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/Godeps/_workspace/src/google.golang.org/grpc/server.go:1044 +0x12b3 fp=0xc46b51ff80 sp=0xc46b51fdc8 pc=0x5633e04b4cb3 google.golang.org/grpc.(Server).serveStreams.func1.1(0xc421783690, 0xc4200bb040, 0x5633e108f140, 0xc42140f4a0, 0xc46ae95d00) /builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/Godeps/_workspace/src/google.golang.org/grpc/server.go:589 +0xa1 fp=0xc46b51ffb8 sp=0xc46b51ff80 pc=0x5633e04bad71 runtime.goexit() /opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc46b51ffc0 sp=0xc46b51ffb8 pc=0x5633e00fcad1 created by google.golang.org/grpc.(Server).serveStreams.func1 /builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/Godeps/_workspace/src/google.golang.org/grpc/server.go:587 + --------------------etcd3 just have warning log------------------------- WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:50:34 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:50:34 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:50:34 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2019/08/09 11:50:34 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport

FengXingYuXin commented 5 years ago

log maybe the result, not the reason。you can perf top -p pid to deal with cpu hot trouble。

phosae commented 5 years ago

Can some body help this issue? perf top output are here: $ perf top --sort comm,dso 99.34% etcd etcd 0.24% perf perf 0.10% perf [kernel] $ perf top 9.33% etcd [.] 0x00000000006315cd 8.77% etcd [.] _start 2.54% etcd [.] 0x0000000000640a81 2.49% etcd [.] 0x000000000064081f 2.40% etcd [.] 0x0000000000640b48 2.19% etcd [.] 0x0000000000640ad1 2.19% etcd [.] 0x0000000000640aba 2.16% etcd [.] 0x0000000000640a97 2.11% etcd [.] 0x0000000000640824 2.09% etcd [.] 0x0000000000640b5e 2.03% etcd [.] 0x0000000000640b7e 1.94% etcd [.] 0x0000000000640aa3 1.94% etcd [.] 0x0000000000640ac5 1.89% etcd [.] 0x0000000000640811 1.88% etcd [.] 0x0000000000640807 1.83% etcd [.] 0x0000000000640a8c 1.80% etcd [.] 0x0000000000640aae 1.45% etcd [.] 0x0000000000640b6b 1.44% etcd [.] 0x0000000000640815 1.41% etcd [.] 0x000000000064081a 1.33% etcd [.] 0x0000000000640a7a 1.24% etcd [.] 0x0000000000640b53 1.15% etcd [.] 0x0000000000640b50 1.02% etcd [.] 0x0000000000640a84 1.01% etcd [.] 0x00000000006407fe 1.01% etcd [.] 0x0000000000640b4c 1.00% etcd [.] 0x00000000006407eb 0.91% etcd [.] 0x0000000000640b45 0.88% etcd [.] 0x0000000000640a9e 0.88% etcd [.] 0x000000000064080c 0.84% etcd [.] 0x0000000000640b57 0.81% [kernel] [k] softirqentry_text_start 0.74% etcd [.] 0x0000000000640b76 0.72% etcd [.] 0x00000000006409c0 0.70% etcd [.] 0x0000000000640b6f 0.65% etcd [.] 0x00000000006409c4 0.64% etcd [.] 0x00000000006407f9 0.63% etcd [.] 0x0000000000640802 0.60% etcd [.] 0x0000000000640b79 0.60% etcd [.] 0x0000000000640a88 0.59% etcd [.] 0x0000000000640a9a 0.58% [kernel] [k] run_timer_softirq 0.56% etcd [.] 0x0000000000640b66 0.54% etcd [.] 0x0000000000640ab5 0.54% etcd [.] 0x0000000000640acc 0.53% etcd [.] 0x0000000000640b72 0.49% etcd [.] 0x000000000063147c 0.41% etcd [.] 0x0000000000640a7e 0.41% etcd [.] 0x00000000006407e8 0.38% etcd [.] 0x0000000000640ab1 0.35% etcd [.] 0x0000000000640a94 0.34% etcd [.] 0x0000000000640a22 0.32% etcd [.] 0x000000000062fdfd 0.31% etcd [.] 0x00000000006315eb 0.30% etcd [.] 0x0000000000640ac8 0.30% etcd [.] 0x00000000006313cf 0.30% etcd [.] 0x0000000000640b63 0.29% etcd [.] 0x0000000000640a26 0.28% [kernel] [k] lock_text_start 0.27% etcd [.] 0x00000000006313d4 0.27% etcd [.] 0x0000000000631528 0.27% etcd [.] 0x000000000062fee6 0.26% etcd [.] 0x000000000057b1c7 0.25% etcd [.] 0x0000000000631457 0.24% etcd [.] 0x00000000006407ef 0.19% etcd [.] 0x000000000063140a 0.18% etcd [.] 0x000000000063155b 0.17% etcd [.] 0x0000000000640b41

$ perf top -g Children Self Shared Object Symbol

34.94% 0.00% etcd [.] 0xffffaa3397c55ad1
34.89% 0.00% etcd [.] 0xffffaa3398013d71
34.89% 0.00% etcd [.] 0xffffaa339800dd1a
34.89% 0.00% etcd [.] 0xffffaa339800ab4a
34.89% 0.00% etcd [.] 0xffffaa3398038029
34.89% 0.00% etcd [.] 0xffffaa33984219ee
34.89% 0.00% etcd [.] 0xffffaa339845092d
34.89% 0.00% etcd [.] 0xffffaa33984217b1
34.89% 0.00% etcd [.] 0xffffaa339845073c
34.89% 0.00% etcd [.] 0xffffaa339842562d
34.89% 0.00% etcd [.] 0xffffaa339842182a
34.89% 0.00% etcd [.] 0xffffaa3398084ba8
34.89% 0.00% etcd [.] 0xffffaa339845312c
34.89% 0.00% etcd [.] 0xffffaa33984483ec
34.89% 0.00% etcd [.] 0xffffaa33983a697c
34.82% 0.00% etcd [.] 0xffffaa33983aa9f5
34.82% 0.00% etcd [.] 0xffffaa33983abff4
34.80% 0.00% etcd [.] 0xffffaa3398130647
34.80% 0.00% etcd [.] 0xffffaa339812da16
34.76% 0.00% etcd [.] 0xffffaa33981260cb
34.76% 0.00% etcd [.] 0xffffaa33981216a5
34.74% 0.00% etcd [.] 0xffffaa339811eb89
32.92% 0.00% etcd [.] 0xffffaa339811f499
32.52% 0.00% etcd [.] 0xffffaa339811ff72
32.06% 0.00% etcd [.] 0xffffaa3397d64538
32.03% 0.00% etcd [.] 0xffffaa3397d673c5
32.02% 0.00% etcd [.] 0xffffaa3397d027d8
22.61% 0.00% etcd [.] 0xffffaa3397d0e457
22.60% 0.04% perf [.] hist_entry_iter__add
22.59% 0.00% etcd [.] 0xffffaa3397d0be6b
12.14% 0.00% libc-2.26.so [.] __libc_start_main
12.14% 0.00% [unknown] [.] 0xd856258d4c544155
12.14% 0.00% perf [.] main
12.14% 0.00% perf [.] 0xffffffffffc9aae1
12.14% 0.00% perf [.] 0xffffffffffc9a7df
12.14% 0.00% perf [.] cmd_top
12.04% 0.00% perf [.] 0xffffffffffc4ce65
7.91% 7.01% etcd [.] _start
7.85% 0.00% etcd [.] 0xffffaa3397d0e34e
7.69% 0.00% etcd [.] 0xffffaa3397d0b1fb
7.47% 0.06% perf [.] callchain_append
7.12% 7.12% etcd [.] 0x00000000006315cd $ perf top -e cycles:k 20.00% [kernel] [k] finish_task_switch
14.74% [kernel] [k] lock_text_start
5.85% [kernel] [k] softirqentry_text_start
5.82% [kernel] [k] tick_nohz_idle_enter
4.61% [kernel] [k] do_syscall_64
3.34% [kernel] [k] get_user_pages_fast
3.22% [kernel] [k] sched_text_start
2.04% [kernel] [k] tick_nohz_idle_exit
1.53% [kernel] [k] __audit_syscall_entry
1.32% [kernel] [k] ena_io_poll
1.01% [kernel] [k] syscall_trace_enter
0.97% [kernel] [k] futex_wait_setup
0.89% [kernel] [k] exit_to_usermode_loop
0.88% [kernel] [k] netif_receive_skb_core
0.86% [kernel] [k] schedule
0.73% [kernel] [k] mark_wake_futex
0.69% [kernel] [k] audit_syscall_exit
0.62% [kernel] [k] futex_wait_queue_me
0.61% [kernel] [k] do_idle
0.56% [kernel] [k] indirect_thunk_start
0.54% [kernel] [k] memset_erms
0.51% [kernel] [k] get_futex_key
0.51% [kernel] [k] __do_page_fault
0.49% [kernel] [k] validate_xmit_skb
0.49% [kernel] [k] dm_rh_inc_pending
0.48% [kernel] [k] mutex_lock
0.47% [kernel] [k] sys_pselect6
0.46% [kernel] [k] __vfs_read
0.46% [kernel] [k] dev_gro_receive
0.44% [kernel] [k] copy_user_generic_unrolled
0.43% [kernel] [k] poll_schedule_timeout
0.42% [kernel] [k] kmem_cache_free
0.42% [kernel] [k] futex_wait
0.40% [kernel] [k] quiet_vmstat
0.39% [kernel] [k] find_first_bit

jingyih commented 5 years ago

As a sanity check, make sure all 3 etcd servers are up and running, and they are connected to each other.

phosae commented 5 years ago

As a sanity check, make sure all 3 etcd servers are up and running, and they are connected to each other.

Every time we restart the cluster, each node are up and running and they are connected to each other. But several day later, CPU overload problem will make node failed to receive heart beat from each other.

The only thing we can do is restarting the cluster, but it can't help decreasing the high cpu and virtual memory usage.

weopm commented 4 years ago

this bug fixed in [#10218]

phosae commented 4 years ago

this bug fixed in [#10218]

thx, this problem doesn't reappear since we update our etcd cluster.

liujjcdl commented 4 years ago

@weopm May I know if etcd 3.3.12 still has the issue? We have similar high CPU usage issue on etcd 3.3.12. Thanks!

weopm commented 4 years ago

this bug is still exit on etcd 3.3.18. Maybe you can do a patch with (https://github.com/etcd-io/etcd/pull/10218/files#diff-a5a4bca15b031f18356513fe1382c3c7L560) or update to etcd 4.4.3.

liujjcdl commented 4 years ago

@weopm thanks for the quick reply, to double confirm, should it be 3.4.3 has the fix?

weopm commented 4 years ago

this bug is fixed on etcd 4.4.3, not 3.4.3. you can review and confirm the code by yourself with (https://github.com/etcd-io/etcd/pull/10218/files#diff-a5a4bca15b031f18356513fe1382c3c7L560).

cjquinon commented 4 years ago

@weopm - We're seeing this issue in production workloads on 3.3.12. I noticed that there was a cherry-pick commit for 3.4, which was merged, and a cherry-pick for 3.3, which was closed and not merged.

Is it possible to get this fix into 3.3 and create a new patch release?

I do see it's in the v3.4.3 tag as well: https://github.com/etcd-io/etcd/blob/v3.4.3/etcdserver/v3_server.go#L607

cjquinon commented 4 years ago

PR to cherry-pick into 3.3: https://github.com/etcd-io/etcd/pull/11378

etcd-io / etcd

3 nodes etcd cluster with two node in high cpu usage #11012