Closed gyuho closed 8 years ago
First, all tests will be run with 1000 clients, 8-byte key, 256-byte value, 2-million keys
until we have clear understanding of tested databases. For this issue, we just need to find the right size of snapCount
for Zookeeper. I will try:
snapCount: 100,000
(default)snapCount: 10,00,000
Shows spikes. If you compare the logs with the spikes, you can tell the leader election was happening...
snapCount 100,000
drops its throughput from 23600 to 1556 at 1460761225
(Fri, 15 Apr 2016 23:00:25 GMT) (https://github.com/coreos/dbtester/blob/master/bench-results/2016041502/01-zk-snapshot1-timeseries.csv)
2016-04-15 23:00:26,154 [myid:1] - WARN [SyncThread:1:FileTxnLog@334] - fsync-ing the write ahead log in SyncThread:1 took 5116ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
, 2016-04-15 23:00:30,467 [myid:1] - WARN [QuorumPeer[myid=1]/0.0.0.0:2181:Follower@89] - Exception when following the leader
, 2016-04-15 23:01:14,808 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:FastLeaderElection@818] - New election. My id = 1, proposed zxid=0x1001736bc
(leader election happens)snapCount 500,000
drops its throughput from 37912 to 8415 at 1460761225
(Fri, 15 Apr 2016 23:00:25 GMT) (https://github.com/coreos/dbtester/blob/master/bench-results/2016041502/01-zk-snapshot2-timeseries.csv)
2016-04-15 23:00:26,115 [myid:1] - WARN [SyncThread:1:FileTxnLog@334] - fsync-ing the write ahead log in SyncThread:1 took 2480ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
snapCount 1,000,000
drops its throughput from 46801 to 11525 at 1460761223
(Fri, 15 Apr 2016 23:00:23 GMT) (https://github.com/coreos/dbtester/blob/master/bench-results/2016041502/01-zk-snapshot3-timeseries.csv)
2016-04-15 23:00:26,115 [myid:1] - WARN [SyncThread:1:FileTxnLog@334] - fsync-ing the write ahead log in SyncThread:1 took 2480ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
So you can tell when spikes happen, there is either long-running snapshotting or leader election. But with bigger snapCount
, you see less of these.
I see the same pattern in other trials. Full logs can be found in cloud storage.
/cc @xiang90
@gyuho For 2 and 3, how do you know they are caused by snapshotting? any related logging?
@xiang90
For case 2, when the througput dropped, this was happening:
2016-04-15 23:00:18,495 [myid:1] - INFO [Snapshot Thread:FileTxnSnapLog@240] - Snapshotting: 0x10017ecf4 to /home/gyuho/zookeeper/data.zk/version-2/snapshot.10017ecf4
2016-04-15 23:00:26,115 [myid:1] - WARN [SyncThread:1:FileTxnLog@334] - fsync-ing the write ahead log in SyncThread:1 took 2480ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
...
2016-04-15 23:00:27,304 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /10.240.0.23:44493
So from 23:00:18
to 23:00:27
, zk was doing snapshotting.
For case 3, it drops its throughput from 46801 to 11525 at 1460761223 (Fri, 15 Apr 2016 23:00:23 GMT):
2016-04-15 22:59:59,889 [myid:1] - INFO [Snapshot Thread:FileTxnSnapLog@240] - Snapshotting: 0x1000cbe5a to /home/gyuho/zookeeper/data.zk/version-2/snapshot.1000cbe5a
2016-04-15 23:00:07,958 [myid:1] - WARN [SyncThread:1:FileTxnLog@334] - fsync-ing the write ahead log in SyncThread:1 took 1688ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
2016-04-15 23:00:24,001 [myid:1] - WARN [SyncThread:1:FileTxnLog@334] - fsync-ing the write ahead log in SyncThread:1 took 2245ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
2016-04-15 23:00:24,424 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
2016-04-15 23:00:25,425 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /10.240.0.19:45766
@gyuho Awesome! Thanks!
Np! I will run more tests.
Closing via https://github.com/coreos/dbtester/blob/master/bench-results/2016041502/README.md. Will come back to this, if we think zk is not performing well as it should.
currently we use default configuration, which might not be optimal in our testing environment.