etcd-io / dbtester

Distributed database benchmark tester
https://github.com/etcd-io/dbtester/tree/master/test-results
Apache License 2.0
283 stars 48 forks source link

tune zookeeper #28

Closed gyuho closed 8 years ago

gyuho commented 8 years ago

currently we use default configuration, which might not be optimal in our testing environment.

gyuho commented 8 years ago

First, all tests will be run with 1000 clients, 8-byte key, 256-byte value, 2-million keys until we have clear understanding of tested databases. For this issue, we just need to find the right size of snapCount for Zookeeper. I will try:

gyuho commented 8 years ago

bench-01-avg-latency-ms

Shows spikes. If you compare the logs with the spikes, you can tell the leader election was happening...

  1. snapCount 100,000 drops its throughput from 23600 to 1556 at 1460761225 (Fri, 15 Apr 2016 23:00:25 GMT) (https://github.com/coreos/dbtester/blob/master/bench-results/2016041502/01-zk-snapshot1-timeseries.csv)
    • log says 2016-04-15 23:00:26,154 [myid:1] - WARN [SyncThread:1:FileTxnLog@334] - fsync-ing the write ahead log in SyncThread:1 took 5116ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide, 2016-04-15 23:00:30,467 [myid:1] - WARN [QuorumPeer[myid=1]/0.0.0.0:2181:Follower@89] - Exception when following the leader, 2016-04-15 23:01:14,808 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:FastLeaderElection@818] - New election. My id = 1, proposed zxid=0x1001736bc (leader election happens)
  2. snapCount 500,000 drops its throughput from 37912 to 8415 at 1460761225 (Fri, 15 Apr 2016 23:00:25 GMT) (https://github.com/coreos/dbtester/blob/master/bench-results/2016041502/01-zk-snapshot2-timeseries.csv)
    • log says 2016-04-15 23:00:26,115 [myid:1] - WARN [SyncThread:1:FileTxnLog@334] - fsync-ing the write ahead log in SyncThread:1 took 2480ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
  3. snapCount 1,000,000 drops its throughput from 46801 to 11525 at 1460761223 (Fri, 15 Apr 2016 23:00:23 GMT) (https://github.com/coreos/dbtester/blob/master/bench-results/2016041502/01-zk-snapshot3-timeseries.csv)
    • log says 2016-04-15 23:00:26,115 [myid:1] - WARN [SyncThread:1:FileTxnLog@334] - fsync-ing the write ahead log in SyncThread:1 took 2480ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide

So you can tell when spikes happen, there is either long-running snapshotting or leader election. But with bigger snapCount, you see less of these.

I see the same pattern in other trials. Full logs can be found in cloud storage.

/cc @xiang90

xiang90 commented 8 years ago

@gyuho For 2 and 3, how do you know they are caused by snapshotting? any related logging?

gyuho commented 8 years ago

@xiang90

For case 2, when the througput dropped, this was happening:

2016-04-15 23:00:18,495 [myid:1] - INFO  [Snapshot Thread:FileTxnSnapLog@240] - Snapshotting: 0x10017ecf4 to /home/gyuho/zookeeper/data.zk/version-2/snapshot.10017ecf4
2016-04-15 23:00:26,115 [myid:1] - WARN  [SyncThread:1:FileTxnLog@334] - fsync-ing the write ahead log in SyncThread:1 took 2480ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide

...

2016-04-15 23:00:27,304 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /10.240.0.23:44493

So from 23:00:18 to 23:00:27, zk was doing snapshotting.

For case 3, it drops its throughput from 46801 to 11525 at 1460761223 (Fri, 15 Apr 2016 23:00:23 GMT):

2016-04-15 22:59:59,889 [myid:1] - INFO  [Snapshot Thread:FileTxnSnapLog@240] - Snapshotting: 0x1000cbe5a to /home/gyuho/zookeeper/data.zk/version-2/snapshot.1000cbe5a
2016-04-15 23:00:07,958 [myid:1] - WARN  [SyncThread:1:FileTxnLog@334] - fsync-ing the write ahead log in SyncThread:1 took 1688ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
2016-04-15 23:00:24,001 [myid:1] - WARN  [SyncThread:1:FileTxnLog@334] - fsync-ing the write ahead log in SyncThread:1 took 2245ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
2016-04-15 23:00:24,424 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception

2016-04-15 23:00:25,425 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /10.240.0.19:45766
xiang90 commented 8 years ago

@gyuho Awesome! Thanks!

gyuho commented 8 years ago

Np! I will run more tests.

gyuho commented 8 years ago

Closing via https://github.com/coreos/dbtester/blob/master/bench-results/2016041502/README.md. Will come back to this, if we think zk is not performing well as it should.