I think current functional-tester is hitting etcd too hard. I have been running the functional tests with 3 of 8-CPU, 16GB memory nodes for a cluster, and 16-CPU, 30GB memory machine for a client (tester). The write rate was about 13,000 keys per second. And this happened:
mvcc: database space exceeded error is not a problem. The problem is context deadline exceeded (10.240.0.19:2379), which in this case, the node3 (10.240.0.19:2379) was never able to catch up or recover from the failure 2016-05-29 20:06:40.097352 I | etcd-tester: [round#44 case#1] injecting failure "kill majority of the cluster". Here are more logs:
(node3 is the node that took too long to recover):
# tester
2016-05-29 20:06:40.061801 I | started stressers
2016-05-29 20:06:40.097352 I | injecting failure "kill majority of the cluster"
2016-05-29 20:06:40.139359 I | injected failure
# this kills node2, node3
# node2
2016-05-29 20:06:40.099573 N | osutil: received terminated signal, shutting down...
# node3 (was leader)
2016-05-29 20:06:40.028660 N | osutil: received terminated signal, shutting down...
# node1 lost leader node3
2016-05-29 20:06:41.712025 I | raft: raft.node: node1 lost leader node3 at term 277
And node1 becomes the new leader:
# node1
2016-05-29 20:06:44.713875 I | raft: node1 became leader at term 279
2016-05-29 20:06:44.728354 I | etcdserver: start to snapshot (applied: 61628860, lastsnap: 61618841)
2016-05-29 20:06:44.735949 I | etcdserver: saved snapshot at index 61628860
# node2
2016-05-29 20:06:44.727350 I | raft: raft.node: node2 elected leader node1 at term 279
2016-05-29 20:06:44.736897 I | etcdserver: published {Name:etcd-1 ClientURLs:[http://10.240.0.18:2379]} to cluster b63fc30507760bc9
2016-05-29 20:06:44.736930 I | etcdmain: ready to serve client requests
# node3
2016-05-29 20:06:44.973825 I | raft: raft.node: node3 elected leader node1 at term 279
2016-05-29 20:06:44.998673 I | raft: node3 [term: 279] ignored a MsgVote message with lower term from node1 [term: 278]
# Now node3 keeps getting this errors for several minutes
2016-05-29 20:06:51.897773 E | etcdserver: publish error: etcdserver: request timed out
2016-05-29 20:06:58.898074 E | etcdserver: publish error: etcdserver: request timed out
...
2016-05-29 20:18:33.241905 E | etcdserver: publish error: etcdserver: request timed out
2016-05-29 20:18:39.726997 W | rafthttp: dropped internal raft message from node1 since receiving buffer is full (overloaded network)
...
2016-05-29 20:26:09.024560 E | etcdserver: publish error: etcdserver: request timed out
2016-05-29 20:26:09.070632 W | rafthttp: dropped internal raft message from node1 since receiving buffer is full (overloaded network) [merged 12 repeated lines in 1.98s]
2016-05-29 20:26:10.070643 W | rafthttp: dropped internal raft message from node1 since receiving buffer is full (overloaded network) [merged 5 repeated lines in 1s]
...
2016-05-29 20:26:13.394135 W | rafthttp: dropped internal raft message to node1 since streamMsg's sending buffer is full (bad/overloaded network)
etcdserver: request timed out can happen when the server is overloaded (raft requests to write ID:17308552749874660866 Method:"PUT" Path:"/0/members/c1d7759a17c2f034/attributes" Val:"{\"name\":\"etcd-0\",\"clientURLs\":[\"http://10.240.0.17:2379\"]}" Dir:false PrevValue:"" PrevIndex:0 Expiration:0 Wait:false Since:0 Recursive:false Sorted:false Quorum:false Time:0 Stream:false):
We should improve etcd logging to make it easier to tell when it's overloaded or rate-limit the tester stresser.
I think current functional-tester is hitting etcd too hard. I have been running the functional tests with 3 of 8-CPU, 16GB memory nodes for a cluster, and 16-CPU, 30GB memory machine for a client (tester). The write rate was about 13,000 keys per second. And this happened:
tester (the recovery timeout limit was increased for me to debug https://github.com/coreos/etcd/issues/5317):
mvcc: database space exceeded
error is not a problem. The problem iscontext deadline exceeded (10.240.0.19:2379)
, which in this case, the node3 (10.240.0.19:2379
) was never able to catch up or recover from the failure2016-05-29 20:06:40.097352 I | etcd-tester: [round#44 case#1] injecting failure "kill majority of the cluster"
. Here are more logs:(node3 is the node that took too long to recover):
And node1 becomes the new leader:
etcdserver: request timed out
can happen when the server is overloaded (raft requests to writeID:17308552749874660866 Method:"PUT" Path:"/0/members/c1d7759a17c2f034/attributes" Val:"{\"name\":\"etcd-0\",\"clientURLs\":[\"http://10.240.0.17:2379\"]}" Dir:false PrevValue:"" PrevIndex:0 Expiration:0 Wait:false Since:0 Recursive:false Sorted:false Quorum:false Time:0 Stream:false
):We should improve etcd logging to make it easier to tell when it's overloaded or rate-limit the tester stresser.