fix(sync): Fix client sync RwLock deadlock and client block request stall

Meshiest commented 3 months ago

Motivation

We have observed two different kinds of stalls while we conduct client-sync acceptance tests on canarynet. These issues are most noticeable and reproducible when block response processing takes a sufficient amount of time. These two issues can be classified as follows:

process.write deadlock during block response
block request starvation

`process.write` Deadlock

Background

There are two places during block sync where process is write locked (atomic_speculate and atomic_finalize), and two places where process is read locked (check_execution_internal and check_fee_internal). Both of the write locks occur in the same try_advancing_with_block_responses in series in check_next_block and add_next_block.

This code is primarily run during a BlockResponse message that invokes advance_with_sync_blocks. It has a guard before calling try_advancing_with_block_responses to ensure no block responses are trying to write in parallel (thus creating a deadlock).

We have observed that this particular area in atomic_post_ratify can take quite a while (more than 5 seconds) while holding the write lock in atomic_speculate.

Client nodes have this loop that kicks off the block requests firing every 5 seconds that coincidentally also calls try_advancing_with_block_responses but without guards to check if another try_advance is running from the aforementioned block responses.

If an atomic_speculate took long enough and continued into the 5 second client sync loop while no block requests were active, this issue would occur and result in a permanent stall without block advancement until the node is restarted.

Fix

We added a guard to the try_block_sync and stopped running into this issue

Block Request Starvation

Background

In order for block requests to be created for a given height it must meet the following criteria (outlined in check_block_request):

Height must be greater than current canon height
A request for this height must not already exist
A response for this height must not already exist
A request timestamp for this height must not already exist

Existing block requests, responses, and timestamps are removed given the following (issue relevant) criteria:

A peer disconnects and all associated block requests and timestamps are removed
The next block is advancing and the request is "complete" by having an empty peer ips set but not if the request is missing, even though peer requests can be removed when the peer ips are empty
A peer provides a block response, which removes the peer ip from the request set without removing the request.
A request is timed out or for a block that has been added already

If the last peer for a block request disconnects and the block request and timestamp is removed, a block response will be left in a cycle where new requests will not be created and the request will not be considered complete for continuing block advancement. This condition results in a permanent stall without block advancement until the node is restarted.

If that alone is fixed, the node will still reach a state where request timestamps are retained until timeout as an "incomplete request" would be one where the request's peer set exists and is not empty OR one where the peer set has been removed (technically empty and not incomplete). This condition results in repeated temporary stalls until the block request is timed out as new block requests will not be created until the request_timestamp is timed out

Fix

By flipping the unwrap_or(false) to unwrap_or(true) in requests.get(height).map(|(_, _, peer_ips)| peer_ips.is_empty()).unwrap_or(false), we are treating the empty peer_ips set as a "completed" peer request and allowing both block advancement to continue and real incomplete blocks to be removed

Test Plan

We have successfully deployed this and synced 7 clients (1 alone on a 3995WX, 2 on a shared 2xEPYC 9354, and 4 on standalone GCP 16vCPU 64GB e2-standard-16 instances) through canary blocks 28,000 through 35,000 that were particularly stressful on clients.

Last week we ran a 100 client sync test with constant stalling as low as block 8000

Today we have two pairs of 100 clients syncing the troublesome blocks right now

The stall in this screenshot is discovering and fixing the temporary stall issue

Note that the servers running these clients are still properly syncing without stalls despite both being under spec and running 10 clients each...

Related PRs

Helps with #3063
Fixes #2916
Fixes #2978

HarukaMa commented 3 months ago

Can confirm as well that this fixes deadlock and stalls, great work!

apruden2008 commented 3 months ago

Thanks @Meshiest @damons for the great work here.

@zosorock this is good to merge pending CI passing

AleoNet / snarkOS