etcd-io / raft

Raft library for maintaining a replicated state machine
Apache License 2.0
666 stars 164 forks source link

Externalize log entry flow control #64

Open pav-kv opened 1 year ago

pav-kv commented 1 year ago

This issue states a problem, and introduces a high-level idea on solving this and similar problems. This is not a ready design yet, more like a conversation starter.

Currently, the raft package takes an active role in managing the flow of log entries in and out of Storage, and pushing them from leader to followers. There is an argument for shifting the flow control responsibility to the application layer, and reducing raft's responsibility to mostly managing correctness (i.e. make it more a passive "library" than an active "framework").


For example, currently, once an entry is committed, raft:

Since raft library is not fully aware of the application layer's resource allocation strategy or the semantics of the commands in the log, it sometimes may push too much work through the Ready struct. The static "max bytes" policy is somewhat helpful in this regard, however in more complex setups it still does not suffice. For example, in CockroachDB one node may host and schedule tens/hundreds of thousands of raft instances, and if many instances push many entries simultaneously, an out-of-memory situation may occur.

This necessitates introducing more ad-hoc back pressure / flow control mechanisms into raft to constrain this flow dynamically. The mechanisms should be flexible enough to be used by many applications (such as etcd and CockroachDB). Generally, these mechanisms are two-fold: a) introspection into raft state before it pushes more work to the application, so that the application can preallocate and/or signal raft to slow down before it's too late; b) the feedback mechanism that signals raft to slow down (e.g. see #60).

As another example, the MsgApp flow from leader to a follower is driven by raft too. There are (mostly) two states in which a flow can be: StateProbe and StateReplicate. In overflow situations, the application layer has no option other than dropping the messages, which eventually causes raft to retry sending the same entry appends. It is currently impossible to ask raft to gracefully slow down instead.


The above are examples of somewhat ad-hoc flow control mechanisms that we currently have or could introduce to workaround the resource overuse issues. For a holistic control, every link in the pipeline requires such a mechanism integrated into the raft library. This enlarges the API surface and implementation complexity, is error-prone, and not necessarily solves the problems optimally for all raft users.

For best control, the responsibility could be shifted to the users. For example, instead of fetching entries and pushing them to the application (+providing introspection and back pressure knobs), raft could simply:


A backwards compatibility consideration should be taken into account. There are applications already relying on flow control mechanisms currently built-in to raft.