AleoNet / snarkOS

A Decentralized Operating System for ZK Applications
http://snarkos.org
Apache License 2.0
4.1k stars 2.57k forks source link

[Optimize] Clear solutions queue and worker solutions on new epoch #3266

Closed raychu86 closed 1 month ago

raychu86 commented 1 month ago

Motivation

This PR clears all the solutions from the solutions queue and worker ready queue when a new epoch starts. This prevents the node from being stuck on iteratively processing old solutions that are no longer relevant but is still held in memory.

Note: This is only done for validator nodes who are advancing via consensus. Syncing validators will not run this cleanup because they should not be processing incoming solutions in the first place.

An alternative approachs:

  1. Pass in a particular epoch hash to retain (so we don't erroneously clear the solutions that are actually valid).
  2. Clear the solutions on a timeout system.
vicsn commented 1 month ago

Pass in a particular epoch hash to retain (so we don't erroneously clear the solutions that are actually valid).

I think its not a big deal, but given that it's a very small additional diff, I'm weakly in favor of passing in the epoch hash to retain.

A brief analysis - we're talking about:

  1. a validator is at epoch N
  2. a validator receives a solution for epoch N+1
  3. the validator advances to epoch N+1, we want to avoid pruning the solution before it is processed

For the solution to have epoch N + 1, at least one validator must have already advanced, and other validators may also advance before provers find the first solution. In other words, retaining the early solutions is a small extra incentive to let miners hit all validators equally with solutions.

raychu86 commented 1 month ago

@vicsn

We only need to filter out the solutions_queue because the Ready queue only contains solutions that are already verified and the new ones for epoch N+1 would not have been verified.

Ideally we could do something like:

self.solutions_queue.lock().retain(|_, solution| solution.id() == self.ledger.latest_epoch_hash()?);

However, the LruCache object does not implement retain, so it get a bit messier. @ljedrz Is there an easy and efficient way to do this?

ljedrz commented 1 month ago

@raychu86 I don't see any way of doing it directly in the LruCache, as it doesn't even expose an Entry API; we'd have to iterate over the cache, collect the entries we want to remove into some temporary collection and then pop them from the cache on a one-by-one basis.