Externalize log entry flow control

This issue states a problem, and introduces a high-level idea on solving this and similar problems. This is not a ready design yet, more like a conversation starter.

Currently, the raft package takes an active role in managing the flow of log entries in and out of Storage, and pushing them from leader to followers. There is an argument for shifting the flow control responsibility to the application layer, and reducing raft's responsibility to mostly managing correctness (i.e. make it more a passive "library" than an active "framework").

For example, currently, once an entry is committed, raft:

fetches it from Storage (+unstable)
pushes it through the Ready struct to the application
expects the application layer to do the job by the next Ready iteration (or a few iterations, in case of async storage writes)

Since raft library is not fully aware of the application layer's resource allocation strategy or the semantics of the commands in the log, it sometimes may push too much work through the Ready struct. The static "max bytes" policy is somewhat helpful in this regard, however in more complex setups it still does not suffice. For example, in CockroachDB one node may host and schedule tens/hundreds of thousands of raft instances, and if many instances push many entries simultaneously, an out-of-memory situation may occur.

This necessitates introducing more ad-hoc back pressure / flow control mechanisms into raft to constrain this flow dynamically. The mechanisms should be flexible enough to be used by many applications (such as etcd and CockroachDB). Generally, these mechanisms are two-fold: a) introspection into raft state before it pushes more work to the application, so that the application can preallocate and/or signal raft to slow down before it's too late; b) the feedback mechanism that signals raft to slow down (e.g. see #60).

As another example, the MsgApp flow from leader to a follower is driven by raft too. There are (mostly) two states in which a flow can be: StateProbe and StateReplicate. In overflow situations, the application layer has no option other than dropping the messages, which eventually causes raft to retry sending the same entry appends. It is currently impossible to ask raft to gracefully slow down instead.

The above are examples of somewhat ad-hoc flow control mechanisms that we currently have or could introduce to workaround the resource overuse issues. For a holistic control, every link in the pipeline requires such a mechanism integrated into the raft library. This enlarges the API surface and implementation complexity, is error-prone, and not necessarily solves the problems optimally for all raft users.

For best control, the responsibility could be shifted to the users. For example, instead of fetching entries and pushing them to the application (+providing introspection and back pressure knobs), raft could simply:

indicate to the application a log index/term range of committed entries
expect that the application fetches and applies the entries at its own pace
react to a message from application confirming that some entries were applied (making sure of the overall raft algorithm correctness)

A backwards compatibility consideration should be taken into account. There are applications already relying on flow control mechanisms currently built-in to raft.

etcd-io / raft

Externalize log entry flow control #64