EspressoSystems / HotShot

http://hotshot.docs.espressosys.com/
101 stars 25 forks source link

[DEPENDENCY_TASK] Fix CDN Crash Tests #3390

Closed bfish713 closed 1 week ago

bfish713 commented 1 week ago

Closes #3368

This PR:

Solves the final tests failures for the dependency task refactor. The failure was caused because for some code paths we can calculate the VID Shares, store them in the vid share map but not emit a VidShareRecv event. This only happens when running with the combined network. There are 2 places we could calculate the VID shares from a DAProposal. The first is in the DA task if the primary is down. The second is if we get a request from another peer for a VID share we'll try to calculate it if we don't have it but do have a DA Proposal. This led to some nodes not voting when they should be able to because the dependency task was waiting for a VIDShareRecv event that wouldn't come.

Typically what I was seeing in CI was that 6/10 nodes would vote and 4 would not. I think the reasoning for this is that the VID calculations were triggered by the requests for VID. What happens was the first nodes to request VID get back a share but the later nodes to request it are already calculating the VID due to the earlier requesters and don't try to request until the calculation finishes and when it does the request task cancels itself.

The sequence that could cause the bug, from the perspective of one were:

  1. Primary Network (CDN) is down
  2. DA Proposal received
  3. DA vote sent
  4. QuorumProposalRecv and validated
  5. Vid Delayed Request spawned (request not yet sent because we are waiting out the delay)
  6. DAC received
  7. VID Reqesest Received from another node, VID calculation started
  8. VID Caclulation finishes
  9. VID delayed request task awakes after waiting the delay and is cancelled because we already have our VID share
  10. View Times out because we (and others) didn't vote and the next leader could not form QC.

In this example the VID calculation could also be triggered if primary network was known down by the node.

The fix for this is to just broadcast that we have the share if we end up cancelling the VID Task because we have the share. This means there is no way for the vote dependency to not get the event if we got the share or calculated it.

This PR does not:

Key places to review: