In #1593, we reviewed the poll_ready implementations in Zebra, but kept the backpressure implementations mostly the same.
Incorrect backpressure implementations can impact resource usage and cause hangs, so we should make sure that Zebra transmits backpressure correctly.
Constraints
The constraints imposed by the tower::Buffer and tower::Batch implementations are:
poll_ready must be called at least once for each call
Once we've reserved a buffer slot, we always get Poll::Ready from a buffer, regardless of the current readiness of the buffer or its underlying service
The Buffer/Batch capacity limits the number of concurrently waiting tasks. Once this limit is reached, further tasks will block, awaiting a free reservation.
Some tasks can depend on other tasks before they resolve. (For example: block validation.) If there are task dependencies, the Buffer/Batch capacity must be larger than the maximum number of concurrently waiting tasks, or Zebra could deadlock (hang).
To avoid deadlocks, we should acquire slots in a consistent order, using the ready! macro, rather than polling them all simultaneously
For example, see #1735:
```rust
// We acquire checkpoint readiness before block readiness, to avoid an unlikely
// hang during the checkpoint to block verifier transition. If the checkpoint and
// block verifiers are contending for the same buffer/batch, we want the checkpoint
// verifier to win, so that checkpoint verification completes, and block verification
// can start. (Buffers and batches have multiple slots, so this contention is unlikely.)
use futures::ready;
// The chain verifier holds one slot in each verifier, for each concurrent task.
// Therefore, any shared buffers or batches polled by these verifiers should double
// their bounds. (For example, the state service buffer.)
ready!(self
.checkpoint
.poll_ready(cx)
.map_err(VerifyChainError::Checkpoint))?;
ready!(self.block.poll_ready(cx).map_err(VerifyChainError::Block))?;
Poll::Ready(Ok(()))
```
Design Tasks
Develop consistent design patterns for Zebra backpressure.
These patterns could include:
transmit backpressure
load shed directly in the service (or the code that calls the service)
ignore backpressure
The design should include advice on when to use each design pattern.
Complex Design Constraints: Inbound Service
The Inbound service currently doesn't transmit backpressure.
Ideally, we'd like to transmit backpressure when:
the state is busy, but only for requests that use the state
the address book is locked, but only for peer requests
We want to drop old requests when the download queue is full, because the latest requests are most likely to be new blocks we haven't seen yet, so we don't want to buffer any download requests. Since we want recent gossiped blocks, we might want to:
* avoid all backpressure in the Inbound service, or
* have a separate queue for gossiped blocks
#### Rejected Designs
We don't want to propagate backpressure in `Inbound::poll_ready`, because we'd be holding a state buffer slot for every concurrent request, even if the request didn't use the state.
We don't want to generate a `Poll::Pending` if the downloader is full, because most requests don't use the downloader. (And we'd have to wake the pending task when the downloader emptied.)
Here's a backpressure design for the address book:
* lock and copy the address book data in `Inbound::call`, so the `Inbound` buffer fills up if the address book is locked or heavily contended
But I'm not sure if there's an API that does the same thing for the state service:
* acquire the state buffer slot in `Inbound::call`, so the `Inbound` buffer fills up if the state is busy
#### Potential Design
We might want to turn `Inbound` requests into a `Stream`, so they hold slots until the future resolves to a result. (Rather than `Buffer`s, which hold slots until the future is created and returned.) That way, we can drop new requests when the `Inbound` stream is full, transmitting backpressure.
Implementation Tasks
Review:
every Zebra Service and Stream
all the code that calls Zebra Services and Streams
and make sure the code uses one of the Zebra backpressure design patterns.
In particular, we should search for the following strings:
poll_ready for Services
ready_and for Services
poll_next for Streams
any other Service or ServiceExt functions that call poll_ready
any other Stream or StreamExt functions that call poll_next
Motivation
In #1593, we reviewed the
poll_ready
implementations in Zebra, but kept the backpressure implementations mostly the same.Incorrect backpressure implementations can impact resource usage and cause hangs, so we should make sure that Zebra transmits backpressure correctly.
Constraints
The constraints imposed by the
tower::Buffer
andtower::Batch
implementations are:poll_ready
must be called at least once for eachcall
Poll::Ready
from a buffer, regardless of the current readiness of the buffer or its underlying serviceBuffer
/Batch
capacity limits the number of concurrently waiting tasks. Once this limit is reached, further tasks will block, awaiting a free reservation.Buffer
/Batch
capacity must be larger than the maximum number of concurrently waiting tasks, or Zebra could deadlock (hang).ready!
macro, rather than polling them all simultaneouslyFor example, see #1735:
Design Tasks
Develop consistent design patterns for Zebra backpressure.
These patterns could include:
The design should include advice on when to use each design pattern.
Complex Design Constraints: Inbound Service
The
Inbound
service currently doesn't transmit backpressure.Ideally, we'd like to transmit backpressure when:
Implementation Tasks
Review:
Service
andStream
Service
s andStream
sand make sure the code uses one of the Zebra backpressure design patterns.
In particular, we should search for the following strings:
poll_ready
forService
sready_and
forServices
poll_next
forStream
sService
orServiceExt
functions that callpoll_ready
Stream
orStreamExt
functions that callpoll_next