ZcashFoundation / zebra

Zcash - Financial Privacy in Rust 🦓
https://zfnd.org/zebra/
Apache License 2.0
408 stars 98 forks source link

Choose between backpressure and load shedding for each service #1618

Closed teor2345 closed 2 years ago

teor2345 commented 3 years ago

Motivation

In #1593, we reviewed the poll_ready implementations in Zebra, but kept the backpressure implementations mostly the same.

Incorrect backpressure implementations can impact resource usage and cause hangs, so we should make sure that Zebra transmits backpressure correctly.

Constraints

The constraints imposed by the tower::Buffer and tower::Batch implementations are:

  1. poll_ready must be called at least once for each call
  2. Once we've reserved a buffer slot, we always get Poll::Ready from a buffer, regardless of the current readiness of the buffer or its underlying service
  3. The Buffer/Batch capacity limits the number of concurrently waiting tasks. Once this limit is reached, further tasks will block, awaiting a free reservation.
  4. Some tasks can depend on other tasks before they resolve. (For example: block validation.) If there are task dependencies, the Buffer/Batch capacity must be larger than the maximum number of concurrently waiting tasks, or Zebra could deadlock (hang).
  5. To avoid deadlocks, we should acquire slots in a consistent order, using the ready! macro, rather than polling them all simultaneously

For example, see #1735:

```rust // We acquire checkpoint readiness before block readiness, to avoid an unlikely // hang during the checkpoint to block verifier transition. If the checkpoint and // block verifiers are contending for the same buffer/batch, we want the checkpoint // verifier to win, so that checkpoint verification completes, and block verification // can start. (Buffers and batches have multiple slots, so this contention is unlikely.) use futures::ready; // The chain verifier holds one slot in each verifier, for each concurrent task. // Therefore, any shared buffers or batches polled by these verifiers should double // their bounds. (For example, the state service buffer.) ready!(self .checkpoint .poll_ready(cx) .map_err(VerifyChainError::Checkpoint))?; ready!(self.block.poll_ready(cx).map_err(VerifyChainError::Block))?; Poll::Ready(Ok(())) ```

Design Tasks

Develop consistent design patterns for Zebra backpressure.

These patterns could include:

The design should include advice on when to use each design pattern.

Complex Design Constraints: Inbound Service

The Inbound service currently doesn't transmit backpressure.

Ideally, we'd like to transmit backpressure when:

We want to drop old requests when the download queue is full, because the latest requests are most likely to be new blocks we haven't seen yet, so we don't want to buffer any download requests. Since we want recent gossiped blocks, we might want to: * avoid all backpressure in the Inbound service, or * have a separate queue for gossiped blocks #### Rejected Designs We don't want to propagate backpressure in `Inbound::poll_ready`, because we'd be holding a state buffer slot for every concurrent request, even if the request didn't use the state. We don't want to generate a `Poll::Pending` if the downloader is full, because most requests don't use the downloader. (And we'd have to wake the pending task when the downloader emptied.) Here's a backpressure design for the address book: * lock and copy the address book data in `Inbound::call`, so the `Inbound` buffer fills up if the address book is locked or heavily contended But I'm not sure if there's an API that does the same thing for the state service: * acquire the state buffer slot in `Inbound::call`, so the `Inbound` buffer fills up if the state is busy #### Potential Design We might want to turn `Inbound` requests into a `Stream`, so they hold slots until the future resolves to a result. (Rather than `Buffer`s, which hold slots until the future is created and returned.) That way, we can drop new requests when the `Inbound` stream is full, transmitting backpressure.

Implementation Tasks

Review:

and make sure the code uses one of the Zebra backpressure design patterns.

In particular, we should search for the following strings:

teor2345 commented 2 years ago

We should fix issues with individual services as they come up.