baidu / braft

An industrial-grade C++ implementation of RAFT consensus algorithm based on brpc, widely used inside Baidu to build highly-available distributed systems.
Apache License 2.0
3.97k stars 884 forks source link

The `on_leader_start` event does not always fire #201

Closed kishorenc closed 4 years ago

kishorenc commented 4 years ago

I noticed this when running braft as a single-node cluster on my local machine. It fires sometimes but not always when running as a single node.

Is this expected?

Some of the logs around the time the braft service became ready, but no leader event fired, in case it is useful:

I0401 17:10:19 12803 /braft/src/braft/snapshot_executor.cpp:254] node default_group:172.21.252.40:8107:8108 snapshot_load_done, last_included_index: 3 last_included_term: 2 peers: "172.21.252.40:8107:8108"
I0401 17:10:19  6915 /braft/src/braft/node.cpp:571] node default_group:172.21.252.40:8107:8108 init, term: 5 last_log_id: (index=10,term=5) conf: 172.21.252.40:8107:8108 old_conf: 
I0401 17:10:19  6915 /braft/src/braft/node.cpp:1543] node default_group:172.21.252.40:8107:8108 term 5 start vote and grant vote self
I0401 17:10:19  6915 /braft/src/braft/raft_meta.cpp:498] Saved single stable meta, path /tmp/data/state/meta term 6 votedfor 172.21.252.40:8107:8108 time: 958
I0401 17:10:19  6915 /braft/src/braft/node.cpp:1776] node default_group:172.21.252.40:8107:8108 term 6 become leader of group 172.21.252.40:8107:8108 
W0401 17:10:19  6915 /braft/src/braft/node.cpp:878] node default_group:172.21.252.40:8107:8108 reset_peer need wait current conf change
Edward-xk commented 4 years ago

I noticed this when running braft as a single-node cluster on my local machine. It fires sometimes but not always when running as a single node.

Is this expected?

Some of the logs around the time the braft service became ready, but no leader event fired, in case it is useful:

I0401 17:10:19 12803 /braft/src/braft/snapshot_executor.cpp:254] node default_group:172.21.252.40:8107:8108 snapshot_load_done, last_included_index: 3 last_included_term: 2 peers: "172.21.252.40:8107:8108"
I0401 17:10:19  6915 /braft/src/braft/node.cpp:571] node default_group:172.21.252.40:8107:8108 init, term: 5 last_log_id: (index=10,term=5) conf: 172.21.252.40:8107:8108 old_conf: 
I0401 17:10:19  6915 /braft/src/braft/node.cpp:1543] node default_group:172.21.252.40:8107:8108 term 5 start vote and grant vote self
I0401 17:10:19  6915 /braft/src/braft/raft_meta.cpp:498] Saved single stable meta, path /tmp/data/state/meta term 6 votedfor 172.21.252.40:8107:8108 time: 958
I0401 17:10:19  6915 /braft/src/braft/node.cpp:1776] node default_group:172.21.252.40:8107:8108 term 6 become leader of group 172.21.252.40:8107:8108 
W0401 17:10:19  6915 /braft/src/braft/node.cpp:878] node default_group:172.21.252.40:8107:8108 reset_peer need wait current conf change

No, it's not expected. on_leader_start() should be called once a new leader is elected and it replays all previous logs successfully. In your case, I'm not sure if it is called or not because: 1.the logs you provide is too short to give a conclusion 2.do you print some msg in your fsm's on_leader_start()? if you do and not see that msg long after the Node is restarted, you may need to check if the Node is stuck at applying some log(check the value of `state_machine' on web page http://{ip}:{port}/raft_stat )

by the way, do you call reset_peer for some test purpose? it will fail when Node is changing configuration(including the process of becoming leader)

kishorenc commented 4 years ago

you may need to check if the Node is stuck at applying some log(check the value of `state_machine' on web page http://{ip}:{port}/raft_stat )

Thank you for that hint -- that was it! I had an unapplied log.

do you call reset_peer for some test purpose?

Yes, I'm doing that when the server is started in a single-node mode (mostly for convenience when local IP changes due to DHCP allocation) and braft is not able to start because of that.