databendlabs / openraft

rust raft with improvements
Apache License 2.0
1.38k stars 155 forks source link

Feature request: optimization for read-only requests #262

Open MrCroxx opened 2 years ago

MrCroxx commented 2 years ago

In Raft thesis 6.4, it proposed a way to optimize read-only requests with read index, which etcd/raft and tikv/raft-rs have already supported (and they further support read-only optimization that is called lease read).

I haven't seen a plan for it in the roadmap. Will the optimization be supported?

The basic procedure to handle read index with openraft can be:

  1. Client sends a read-only request to the leader or a follower.
  2. (optional) If a follower receives a read-only request, the follower send a ReadIndex request to the current leader (if there is one). (To support follower read.)
  3. If a leader receives a read-only request, the leader send a ReadIndex request to itself.
  4. (optional) If a newly stepped leader receives a ReadIndex request before it commits an empty entry, the leader saves the ReadIndex request to pending_read_indices. (To make sure the leader has the largest committed index of the group.)
  5. If a leader receives a ReadIndex request or a newly stepped leader successfully committed an empty entry, the leader uses the current committed as ReadIndex.
  6. (optional) If the ReadIndex request comes from a follower, the leader responds to the follower with the ReadIndex.
  7. When applying entries to the state machine. If the applied index meets a ReadIndex, the related read-only request can be served.
github-actions[bot] commented 2 years ago

👋 Thanks for opening this issue!

Get help or engage by:

MrCroxx commented 2 years ago

I'd like to contribute to the optimization. But I'm busy with my master degree project, my thesis, and my internship job.

If time is not a big issue, I'd like to pick the challenge.

drmingdrmer commented 2 years ago

Currently, the extensible and decoupled overall structure of the framework has the highest priority, to make some performance-related refactoring possible, which is suggested by @schreter .

Everything that is well defined in the Raft thesis has a relatively low priority because there won't be surprises in implementing it.

Unless somebody needs it. :)

schreter commented 2 years ago

Well, I was planning to suggest implementing lease read myself :-). This indeed can speed up the reads considerably. OTOH if I understand it correctly, there is the additional cost in the takeover path, where lease time must be awaited. Picking too short lease time leads to frequent reassertions, picking too long lease time then to delayed election.

For our project, we'll likely use an independent external sequencer keeping "transaction time", which, as a bonus, will allow us to safely read from replicas as well. But that's nowhere as general as read leases :-).

In other words, for me, time is not a big issue, so feel free to experiment with it at your own pace.

drmingdrmer commented 2 years ago

Well, I was planning to suggest implementing lease read myself :-). This indeed can speed up the reads considerably. OTOH if I understand it correctly, there is the additional cost in the takeover path, where lease time must be awaited. Picking too short lease time leads to frequent reassertions, picking too long lease time then to delayed election.

For our project, we'll likely use an independent external sequencer keeping "transaction time", which, as a bonus, will allow us to safely read from replicas as well. But that's nowhere as general as read leases :-).

In other words, for me, time is not a big issue, so feel free to experiment with it at your own pace.

Agree.

Raft defines its own time with term and log index. Although people try hard to synchronize this raft-defined-time with the wall-clock time, the best way is to forget the wall-clock time.

schreter commented 2 years ago

Raft defines its own time with term and log index.

Hehe, I didn't intend to make a pun on "time" :-). The "transaction time" we use is a kind of MVCC, so as long as the follower replica knows that certain time point from the sequencer was replicated, any requests with older "transaction time" than this time point can be answered directly. But again, that's relevant for our specific use case, it's not a general solution.

Although people try hard to synchronize this raft-defined-time with the wall-clock time, the best way is to forget the wall-clock time.

+1. We also don't use wall-clock time.

Of course, the wall-clock time is still used in Raft for timeouts... To get rid of that, we need something like this: https://web.stanford.edu/class/ee380/Abstracts/161116-slides.pdf.

drmingdrmer commented 2 years ago

Of course, the wall-clock time is still used in Raft for timeouts... To get rid of that, we need something like this: https://web.stanford.edu/class/ee380/Abstracts/161116-slides.pdf.

Maybe the election timeout can be replaced with some external event source that triggers re-election. E.g., let a user install a metrics watcher, to monitor the leader messages received by a raft node. And its duty is to tell the raft node to elect if the watcher does not receive a leader message in 10 seconds.

This way the raft core becomes purely event-driven.

schreter commented 2 years ago

Maybe the election timeout can be replaced with some external event source that triggers re-election.

That's indeed a good idea and it would help in our project too (we have a "liveness" indication between a pair of nodes, which can be used for any communication between this pair of nodes in multiple consensus domains which happen to have replicas on this pair). It would also help removing the non-determinism from tests.

But, it will require fairly large refactoring of various tests, I suppose.

lichuang commented 2 years ago

+1. We also don't use wall-clock time.

In etcd raft implementation, the time event if trigger by user, as a tick() call in interface, and heart time and election time is defined as tick unit, which is not a wall-clock time. see:

type Config struct {
    // ElectionTick is the number of Node.Tick invocations that must pass between
    // elections. That is, if a follower does not receive any message from the
    // leader of current term before ElectionTick has elapsed, it will become
    // candidate and start an election. ElectionTick must be greater than
    // HeartbeatTick. We suggest ElectionTick = 10 * HeartbeatTick to avoid
    // unnecessary leader switching.
    ElectionTick int

    // HeartbeatTick is the number of Node.Tick invocations that must pass between
    // heartbeats. That is, a leader sends heartbeat messages to maintain its
    // leadership every HeartbeatTick ticks.
    HeartbeatTick int
}

etcd raft time event such as election\heartbeat is a full trigger by user.may be we optimize openraft in this way, let me think about it.

lichuang commented 2 years ago

+1. We also don't use wall-clock time.

In etcd raft implementation, the time event if trigger by user, as a tick() call in interface, and heart time and election time is defined as tick unit, which is not a wall-clock time. see:

type Config struct {
  // ElectionTick is the number of Node.Tick invocations that must pass between
  // elections. That is, if a follower does not receive any message from the
  // leader of current term before ElectionTick has elapsed, it will become
  // candidate and start an election. ElectionTick must be greater than
  // HeartbeatTick. We suggest ElectionTick = 10 * HeartbeatTick to avoid
  // unnecessary leader switching.
  ElectionTick int

  // HeartbeatTick is the number of Node.Tick invocations that must pass between
  // heartbeats. That is, a leader sends heartbeat messages to maintain its
  // leadership every HeartbeatTick ticks.
  HeartbeatTick int
}

etcd raft time event such as election\heartbeat is a full trigger by user.may be we optimize openraft in this way, let me think about it.

emm, after i read the source, i think refactor the time-trigger policy like etcd way may be difficult in openraft.

in etcd, raft core and application communicate with msg array: send msg\commited msg,etc, the the application level can send the msg in there way, which decouple the network from raft core, so etcd can be trigger by tick from application.

But in openraft, it has the RaftNetWork interface, raft core can directly send msg using this interface.

drmingdrmer commented 2 years ago

in etcd, raft core and application communicate with msg array: send msg\commited msg,etc, the the application level can send the msg in there way, which decouple the network from raft core, so etcd can be trigger by tick from application.

I'm trying to extract the algorithm part out of raft core. And left everything else, such as RaftStorage, RaftNetwork, the timer, and message channels to a runtime. And let the timer a customizable component.

avantgardnerio commented 1 year ago

I was planning to suggest implementing lease read myself :-). This indeed can speed up the reads considerably

@drmingdrmer @schreter if I understand this correctly, that would mean there would be a leader that knows it's the leader for a given time interval? Could this be used to assign unique monotonic timestamps from the leader without appending them to the raft log?

I ask because we don't (yet) have this:

For our project, we'll likely use an independent external sequencer keeping "transaction time"

and the options as I see them are:

  1. create a proprietary timestamp oracle separate from raft (finicky, essentially requiring re-implementing something like raft for failover, split brain, etc)
  2. do what we do now and get unique timestamps from raft, but through the write path including log persistence to disk (slow)
  3. switch to tikv's raft (lose tokio/async goodness, and also have implement the timestamp feature? I don't see that they have it yet, but with leader leases it should be doable)
  4. switch to logical clocks or HLCs, but then we have to deal with timestamps being "pushed" into the future, read-refreshes, etc

Unless somebody needs it. :)

I'm willing to be corrected on any of these points, but I think my company "needs" it.

avantgardnerio commented 1 year ago

To be overly clear, here's pseudocode for what I'm proposing:

struct Raft {
   timestamp_gen: AtomicU64,
   ...
}

impl Raft {
   pub fn get_unique_ts() -> Result<(Term, u64), Error> {
      if !self.is_leader {
         Err(FwdToLeader)?;
      }
      if !self.has_lease {
         Err(WaitForLease)?;
      }
      (self.term, self.timestamp_gen.inc_and_get())
   }
}
drmingdrmer commented 1 year ago

I was planning to suggest implementing lease read myself :-). This indeed can speed up the reads considerably

@drmingdrmer @schreter if I understand this correctly, that would mean there would be a leader that knows it's the leader for a given time interval?

You are correct! Openraft has a mechanism called leader lease. This mechanism ensures that a follower does not elect itself as a leader before a certain period of time, which is determined by adding the timer_config.leader_lease + timer_config.election_timeout. The follower refreshes the lease by updating the last-updated-time of the vote. However, the current issue is that the leader does not update the lease yet.

To support reading consistent state, openraft needs to :

https://github.com/datafuselabs/openraft/blob/54154202beec3e2de433044baae505cc80db375b/openraft/src/core/raft_core.rs#L1345-L1357

https://github.com/datafuselabs/openraft/blob/54154202beec3e2de433044baae505cc80db375b/openraft/src/engine/handler/vote_handler/mod.rs#L124

Could this be used to assign unique monotonic timestamps from the leader without appending them to the raft log?

Yes, internally a timestamp is updated for every tick: https://github.com/datafuselabs/openraft/blob/54154202beec3e2de433044baae505cc80db375b/openraft/src/core/raft_core.rs#L1182-L1185

I ask because we don't (yet) have this:

For our project, we'll likely use an independent external sequencer keeping "transaction time"

and the options as I see them are:

  1. create a proprietary timestamp oracle separate from raft (finicky, essentially requiring re-implementing something like raft for failover, split brain, etc)
  2. do what we do now and get unique timestamps from raft, but through the write path including log persistence to disk (slow)
  3. switch to tikv's raft (lose tokio/async goodness, and also have implement the timestamp feature? I don't see that they have it yet, but with leader leases it should be doable)
  4. switch to logical clocks or HLCs, but then we have to deal with timestamps being "pushed" into the future, read-refreshes, etc

One potential solution is to use the raft-log-id as a pseudo time. This solution involves:

Since the committed log id ((term, index)) in raft is monotonic, it can be used as a measure of time. However, this solution requires the application to track the log-id amount across a cluster.

Unless somebody needs it. :)

I'm willing to be corrected on any of these points, but I think my company "needs" it.

Roger it.

avantgardnerio commented 1 year ago

@drmingdrmer , thank you for your quick response.

One potential solution is to use the raft-log-id as a pseudo time

I think this is what we are doing now, but by using writes to get a unique timestamp. I've seen in documentation that OpenRaft has been tested at 30k messages per second, but for our application we would need this volume (and possibly more) just for assigning timestamps, much less doing any "real" raft operations.

If everyone agrees who the leader is, I was hoping it could rapidly assign timestamps simply by bumping an atomic int and serving them as rapidly as possible.

and add a public API such as Raft::read(sm: &mut impl RaftStateMachine) that atomically checks leader lease and reads some data from the state machine.

So similar to the above, but instead of reading data it would increment-and-get.

Even typing this out, I realize it is hitting the limits of what is possible (Hyper says it can serve about 80k requests per second), so perhaps logical clocks are our only option, but I wanted to see if you thought the solution I proposed above was possible.

drmingdrmer commented 1 year ago

I think this is what we are doing now, but by using writes to get a unique timestamp. I've seen in documentation that OpenRaft has been tested at 30k messages per second, but for our application we would need this volume (and possibly more) just for assigning timestamps, much less doing any "real" raft operations.

Without network and storage overhead, it is about 1 million rps with 256 clients. But for a real world application the TPS would be much less.

Even typing this out, I realize it is hitting the limits of what is possible (Hyper says it can serve about 80k requests per second), so perhaps logical clocks are our only option, but I wanted to see if you thought the solution I proposed above was possible.

If you were building a timestamp assigning service, it could be simpler:

The second can be done by binding timestamps to logs:

Because a new leader will propose and commit a blank log, the new leader will always generate greater timestamps.

And such a timestamp can be easily mapped to clock time, by embedding a clock-time in the raft-log.

avantgardnerio commented 1 year ago

@drmingdrmer thank you! Fantastic response. Nice to see its already possible without changes.