databendlabs / openraft

rust raft with improvements
Apache License 2.0
1.41k stars 158 forks source link

Metrics: add last acked heartbeat timestamp for each follower/learner #1176

Closed drmingdrmer closed 4 months ago

drmingdrmer commented 4 months ago

I guess this won't work in the case where no changes will be made to the state machine after a follower node becomes offline as the applied log indexes will remain the same, thus the difference will not change.

You're correct. To better assess the connectivity status of a follower node, it would be beneficial to add a metric that tracks the timestamp of the last acknowledged heartbeat from each follower. This additional information would provide a more accurate and timely indication of the follower's connectivity status. Here's how we could refine this idea:

  1. Add a new field to the RaftMetrics struct specifically for follower heartbeat information:
pub struct RaftMetrics<C: RaftTypeConfig> {
    // ... existing fields ...
    pub follower_heartbeats: HashMap<C::NodeId, Instant>,
}
  1. Update this field whenever a follower acknowledges a heartbeat or successfully replicates an entry:
impl<C: RaftTypeConfig> RaftMetrics<C> {
    pub fn update_follower_heartbeat(&mut self, follower_id: C::NodeId) {
        self.follower_heartbeats.insert(follower_id, Instant::now());
    }
}
  1. In the leader's routine that checks follower health:
const HEARTBEAT_TIMEOUT: Duration = Duration::from_secs(/* define your timeout */);

for (follower_id, last_heartbeat) in &raft_metrics.follower_heartbeats {
    if last_heartbeat.elapsed() > HEARTBEAT_TIMEOUT {
        // Consider this follower as potentially disconnected
    }
}

Originally posted by @drmingdrmer in https://github.com/datafuselabs/openraft/discussions/1174#discussioncomment-10079123

github-actions[bot] commented 4 months ago

👋 Thanks for opening this issue!

Get help or engage by:

SteveLauC commented 4 months ago

This issue can be closed as it was completed in #1177.