etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
46.89k stars 9.65k forks source link

functional-tester: tester is stressing etcd too hard #5499

Closed gyuho closed 8 years ago

gyuho commented 8 years ago

I think current functional-tester is hitting etcd too hard. I have been running the functional tests with 3 of 8-CPU, 16GB memory nodes for a cluster, and 16-CPU, 30GB memory machine for a client (tester). The write rate was about 13,000 keys per second. And this happened:


tester (the recovery timeout limit was increased for me to debug https://github.com/coreos/etcd/issues/5317):

2016-05-29 20:06:40.061788 I | etcd-tester: [round#44 case#0] starting the stressers...
2016-05-29 20:06:40.061801 I | etcd-tester: [round#44 case#0] started stressers
2016-05-29 20:06:40.061829 I | etcd-tester: [round#44 case#0] succeed!
2016-05-29 20:06:40.097352 I | etcd-tester: [round#44 case#1] injecting failure "kill majority of the cluster"
2016-05-29 20:06:40.139359 I | etcd-tester: [round#44 case#1] injected failure
2016-05-29 20:06:40.139412 I | etcd-tester: [round#44 case#1] recovering failure "kill majority of the cluster"
2016-05-29 20:06:41.142031 W | etcd-tester: #0 setHealthKey error (rpc error: code = 4 desc = context deadline exceeded (10.240.0.17:2379))
2016-05-29 20:06:43.142713 W | etcd-tester: #1 setHealthKey error (rpc error: code = 4 desc = context deadline exceeded (10.240.0.17:2379))

...

2016-05-29 20:08:39.489727 W | etcd-tester: #58 setHealthKey error (rpc error: code = 4 desc = context deadline exceeded (10.240.0.19:2379))
2016-05-29 20:08:41.512028 W | etcd-tester: #59 setHealthKey error (rpc error: code = 4 desc = context deadline exceeded (10.240.0.19:2379))
2016-05-29 20:08:43.533512 W | etcd-tester: #60 setHealthKey error (rpc error: code = 4 desc = context deadline exceeded (10.240.0.19:2379))
2016-05-29 20:08:45.557819 W | etcd-tester: #61 setHealthKey error (rpc error: code = 4 desc = context deadline exceeded (10.240.0.19:2379))

...

2016-05-29 20:10:02.444335 W | etcd-tester: #99 setHealthKey error (rpc error: code = 4 desc = context deadline exceeded (10.240.0.19:2379))
2016-05-29 20:10:04.460945 W | etcd-tester: #100 setHealthKey error (rpc error: code = 4 desc = context deadline exceeded (10.240.0.19:2379))

...

2016-05-29 20:16:51.986884 W | etcd-tester: #301 setHealthKey error (rpc error: code = 4 desc = context deadline exceeded (10.240.0.19:2379))
2016-05-29 20:16:54.023672 W | etcd-tester: #302 setHealthKey error (rpc error: code = 4 desc = context deadline exceeded (10.240.0.19:2379))
2016-05-29 20:16:55.026683 W | etcd-tester: #303 setHealthKey error (rpc error: code = 8 desc = etcdserver: mvcc: database space exceeded (10.240.0.17:2379))
2016-05-29 20:16:56.029911 W | etcd-tester: #304 setHealthKey error (rpc error: code = 8 desc = etcdserver: mvcc: database space exceeded (10.240.0.17:2379))

...

2016-05-31 00:06:02.716756 W | etcd-tester: #99997 setHealthKey error (rpc error: code = 8 desc = etcdserver: mvcc: database space exceeded (10.240.0.17:2379))
2016-05-31 00:06:03.721698 W | etcd-tester: #99998 setHealthKey error (rpc error: code = 8 desc = etcdserver: mvcc: database space exceeded (10.240.0.17:2379))
2016-05-31 00:06:04.727323 W | etcd-tester: #99999 setHealthKey error (rpc error: code = 8 desc = etcdserver: mvcc: database space exceeded (10.240.0.17:2379))
2016-05-31 00:06:05.727660 I | etcd-tester: [round#44 case#1] recovery error: rpc error: code = 8 desc = etcdserver: mvcc: database space exceeded (10.240.0.17:2379)
2016-05-31 00:06:05.727731 I | etcd-tester: [round#44 case#1] cleaning up...
2016-05-31 00:06:08.705008 W | etcd-tester: #0 setHealthKey error (rpc error: code = 4 desc = context deadline exceeded (10.240.0.17:2379))

mvcc: database space exceeded error is not a problem. The problem is context deadline exceeded (10.240.0.19:2379), which in this case, the node3 (10.240.0.19:2379) was never able to catch up or recover from the failure 2016-05-29 20:06:40.097352 I | etcd-tester: [round#44 case#1] injecting failure "kill majority of the cluster". Here are more logs:


(node3 is the node that took too long to recover):

# tester
2016-05-29 20:06:40.061801 I | started stressers
2016-05-29 20:06:40.097352 I | injecting failure "kill majority of the cluster"
2016-05-29 20:06:40.139359 I | injected failure
# this kills node2, node3

# node2
2016-05-29 20:06:40.099573 N | osutil: received terminated signal, shutting down...

# node3 (was leader)
2016-05-29 20:06:40.028660 N | osutil: received terminated signal, shutting down...

# node1 lost leader node3
2016-05-29 20:06:41.712025 I | raft: raft.node: node1 lost leader node3 at term 277

And node1 becomes the new leader:

# node1
2016-05-29 20:06:44.713875 I | raft: node1 became leader at term 279
2016-05-29 20:06:44.728354 I | etcdserver: start to snapshot (applied: 61628860, lastsnap: 61618841)
2016-05-29 20:06:44.735949 I | etcdserver: saved snapshot at index 61628860

# node2
2016-05-29 20:06:44.727350 I | raft: raft.node: node2 elected leader node1 at term 279
2016-05-29 20:06:44.736897 I | etcdserver: published {Name:etcd-1 ClientURLs:[http://10.240.0.18:2379]} to cluster b63fc30507760bc9
2016-05-29 20:06:44.736930 I | etcdmain: ready to serve client requests

# node3
2016-05-29 20:06:44.973825 I | raft: raft.node: node3 elected leader node1 at term 279
2016-05-29 20:06:44.998673 I | raft: node3 [term: 279] ignored a MsgVote message with lower term from node1 [term: 278]

# Now node3 keeps getting this errors for several minutes
2016-05-29 20:06:51.897773 E | etcdserver: publish error: etcdserver: request timed out
2016-05-29 20:06:58.898074 E | etcdserver: publish error: etcdserver: request timed out
...
2016-05-29 20:18:33.241905 E | etcdserver: publish error: etcdserver: request timed out
2016-05-29 20:18:39.726997 W | rafthttp: dropped internal raft message from node1 since receiving buffer is full (overloaded network)
...
2016-05-29 20:26:09.024560 E | etcdserver: publish error: etcdserver: request timed out
2016-05-29 20:26:09.070632 W | rafthttp: dropped internal raft message from node1 since receiving buffer is full (overloaded network) [merged 12 repeated lines in 1.98s]
2016-05-29 20:26:10.070643 W | rafthttp: dropped internal raft message from node1 since receiving buffer is full (overloaded network) [merged 5 repeated lines in 1s]
...
2016-05-29 20:26:13.394135 W | rafthttp: dropped internal raft message to node1 since streamMsg's sending buffer is full (bad/overloaded network)

etcdserver: request timed out can happen when the server is overloaded (raft requests to write ID:17308552749874660866 Method:"PUT" Path:"/0/members/c1d7759a17c2f034/attributes" Val:"{\"name\":\"etcd-0\",\"clientURLs\":[\"http://10.240.0.17:2379\"]}" Dir:false PrevValue:"" PrevIndex:0 Expiration:0 Wait:false Since:0 Recursive:false Sorted:false Quorum:false Time:0 Stream:false):

We should improve etcd logging to make it easier to tell when it's overloaded or rate-limit the tester stresser.

gyuho commented 8 years ago

Closing since it's directly related to https://github.com/coreos/etcd/issues/5317.