This issue states a problem, and introduces a high-level idea on solving this and similar problems. This is not a ready design yet, more like a conversation starter.
Currently, the raft package takes an active role in managing the flow of log entries in and out of Storage, and pushing them from leader to followers. There is an argument for shifting the flow control responsibility to the application layer, and reducing raft's responsibility to mostly managing correctness (i.e. make it more a passive "library" than an active "framework").
For example, currently, once an entry is committed, raft:
pushes it through the Ready struct to the application
expects the application layer to do the job by the next Ready iteration (or a few iterations, in case of async storage writes)
Since raft library is not fully aware of the application layer's resource allocation strategy or the semantics of the commands in the log, it sometimes may push too much work through the Ready struct. The static "max bytes" policy is somewhat helpful in this regard, however in more complex setups it still does not suffice. For example, in CockroachDB one node may host and schedule tens/hundreds of thousands of raft instances, and if many instances push many entries simultaneously, an out-of-memory situation may occur.
This necessitates introducing more ad-hoc back pressure / flow control mechanisms into raft to constrain this flow dynamically. The mechanisms should be flexible enough to be used by many applications (such as etcd and CockroachDB). Generally, these mechanisms are two-fold: a) introspection into raft state before it pushes more work to the application, so that the application can preallocate and/or signal raft to slow down before it's too late; b) the feedback mechanism that signals raft to slow down (e.g. see #60).
As another example, the MsgApp flow from leader to a follower is driven by raft too. There are (mostly) two states in which a flow can be: StateProbe and StateReplicate. In overflow situations, the application layer has no option other than dropping the messages, which eventually causes raft to retry sending the same entry appends. It is currently impossible to ask raft to gracefully slow down instead.
The above are examples of somewhat ad-hoc flow control mechanisms that we currently have or could introduce to workaround the resource overuse issues. For a holistic control, every link in the pipeline requires such a mechanism integrated into the raft library. This enlarges the API surface and implementation complexity, is error-prone, and not necessarily solves the problems optimally for all raft users.
For best control, the responsibility could be shifted to the users. For example, instead of fetching entries and pushing them to the application (+providing introspection and back pressure knobs), raft could simply:
indicate to the application a log index/term range of committed entries
expect that the application fetches and applies the entries at its own pace
react to a message from application confirming that some entries were applied (making sure of the overall raft algorithm correctness)
A backwards compatibility consideration should be taken into account. There are applications already relying on flow control mechanisms currently built-in to raft.
This issue states a problem, and introduces a high-level idea on solving this and similar problems. This is not a ready design yet, more like a conversation starter.
Currently, the
raft
package takes an active role in managing the flow of log entries in and out ofStorage
, and pushing them from leader to followers. There is an argument for shifting the flow control responsibility to the application layer, and reducingraft
's responsibility to mostly managing correctness (i.e. make it more a passive "library" than an active "framework").For example, currently, once an entry is committed,
raft
:Storage
(+unstable
)Ready
struct to the applicationReady
iteration (or a few iterations, in case of async storage writes)Since raft library is not fully aware of the application layer's resource allocation strategy or the semantics of the commands in the log, it sometimes may push too much work through the
Ready
struct. The static "max bytes" policy is somewhat helpful in this regard, however in more complex setups it still does not suffice. For example, in CockroachDB one node may host and schedule tens/hundreds of thousands ofraft
instances, and if many instances push many entries simultaneously, an out-of-memory situation may occur.This necessitates introducing more ad-hoc back pressure / flow control mechanisms into
raft
to constrain this flow dynamically. The mechanisms should be flexible enough to be used by many applications (such asetcd
andCockroachDB
). Generally, these mechanisms are two-fold: a) introspection intoraft
state before it pushes more work to the application, so that the application can preallocate and/or signalraft
to slow down before it's too late; b) the feedback mechanism that signalsraft
to slow down (e.g. see #60).As another example, the
MsgApp
flow from leader to a follower is driven byraft
too. There are (mostly) two states in which a flow can be: StateProbe and StateReplicate. In overflow situations, the application layer has no option other than dropping the messages, which eventually causesraft
to retry sending the same entry appends. It is currently impossible to askraft
to gracefully slow down instead.The above are examples of somewhat ad-hoc flow control mechanisms that we currently have or could introduce to workaround the resource overuse issues. For a holistic control, every link in the pipeline requires such a mechanism integrated into the
raft
library. This enlarges the API surface and implementation complexity, is error-prone, and not necessarily solves the problems optimally for allraft
users.For best control, the responsibility could be shifted to the users. For example, instead of fetching entries and pushing them to the application (+providing introspection and back pressure knobs),
raft
could simply:raft
algorithm correctness)A backwards compatibility consideration should be taken into account. There are applications already relying on flow control mechanisms currently built-in to
raft
.