AleoNet / snarkOS

A Decentralized Operating System for ZK Applications
http://snarkos.org
Apache License 2.0
4.24k stars 2.59k forks source link

Limit parallel processing for clients #3358

Open vicsn opened 2 weeks ago

vicsn commented 2 weeks ago

Motivation

Clients run out of memory because they have no limit on how many transactions or solutions they verify in parallel. This PR proposes to queue them (just like the validator does in Consensus) and limit how much parallel verification we do.

We can do a lot of clever things to increase the processing speed, like check how many constraints the incoming transactions have, await on a channel to rapidly start verifying, but the focus for now is simplicity and safety.

Even though it was recently suggested clients should have at least 128 GiB of memory, the current implementation uses "only" up to 30 GiB for transaction verification. The right number is up for debate.

Test Plan

CI passed Ran a local network shooting some executions. In canary, more serious concurrent traffic can be shot at the network.

Related PRs

Potentially closes: https://github.com/AleoNet/snarkOS/issues/3341 Replaces: https://github.com/AleoNet/snarkOS/pull/2970

vicsn commented 2 weeks ago

An open design question is how much memory should be reserved for worst case transaction + solution verification.

Summoning @evanmarshall @zosorock @Meshiest . Provable previously recommended that clients should have 128 GiB of RAM, but I understand you and others want to run with less RAM. So my questions to you:

  1. What do you think is a sufficient default amount of RAM which should be assumed to be available for transaction/solution verification?
  2. How badly do you want an additional feature flag which lets you increase the amount of RAM used for transaction/solution verification?
Meshiest commented 2 weeks ago

I have an undiversified node operational experience so I don't have a good estimate for 1 and personally haven't run into anything needing the feature flag from 2

The main limiter we've observed in client sync is core count. 16 cores was barely able to keep up with block production on canary after the 15tps tests. Upgrading a client VM from 16 to 32 cores massively increased sync speed. Our servers with more than 64 cores were powerhouses when syncing. RAM seemed less important in our tests, though we weren't using large programs or verifying any solutions.

vicsn commented 2 weeks ago

@raychu86 sorry added a separate execution queue, I couldn't let a simple 200x improvement in max throughput slide (assuming available compute): 9049f3e34

In theory we could also use the sysinfo crate to fetch the total memory of the machine thats running the node.

Yes thought about it, but this dynamic behaviour will complicate infra planning too much I think, so I think there should rather be a --available-ram flag or something if users have very diverse preferences.