Client: Explore Multithreading Opportunities

holgerd77 commented 1 month ago

Today I re-discovered all the problem sets and complexities coming up when thinking about how/where to multi-thread the client, and I wanted to give this some write-up, so that it's not necessary to re-discover again and again and can concentrate on broader or selective solutions.

Technological Options

So there are basically three options on where to build upon: Web Workers are a somewhat long (2009) existing web standard. Node.js does not support Web Workers (bun e.g. supports a subset), there is a web-worker package which might be able to somewhat mitigate.

Node.js comes with a Worker Threads API, I think somewhat close to, but not compatible with Web Workers. Bun supports worker threads. Browsers does not.

There is a very very heavily (7Mio Downloads per week) used library workerpool, where the feature description sounds very promising (very small, works in Node + browser).

My current tendency would be to go and try with workerpool. As far as I got it I think though that the general design of these libraries + the structural changes needed for multithreading is for all solutions pretty similar. So I think it is not too dangerous respectively too much of a loss to choose 1 solution + be not 100% sure and eventually switch later. 90% of the work will nevertheless be able to just stay as it is.

(Eventual) Technical Challenges / Limitations

Very Limited Data Exchange

There are strong limitations on what type of data can be exchanged between workers/threads and one basically needs to limit to basic/primitive data types (string) or some built-in JS low-level types (ArrayBuffer), also objects without functions. Passing functions or objects with functions is not allowed, see Node.js docs here or the respective direct Structured Clone Algorithm docs (basically everything which can be structured-cloned can be used).

This has strong implications on the design space. Sharing Common? Not allowed. Passing a tx object? Not allowed. Sharing client config? Not allowed (in the current form).

Useful only for CPU-intense Tasks (not Networking e.g.)

Using threads seems only be useful for CPU-intense tasks, I've read this now from several sources, here is one stack overflow answer especially on networking:

This is not a highly supported answer, at least marked as correct though (couldn't find a better source on this specific one quickly). So for now I would assume respectively it seems that e.g. taking out our networking layer/devp2p will likely not have the desired effect.

Multi-threaded DB Access

Ok, I also wanted to write something limitational here, but this might be the more positive part: I think I misread on previous rounds and did not make a proper process/thread distinction (I initially thought concurrent DB access from different workers is not possible). Seems so I am wrong and LevelDB e.g. is thread-safe.

Potential Starting Points

Go Smaller

One way to approach might be to think a lot smaller on first round, and try to micro-threadify very singular tasks where a low amount of data is passed.For the workerpool libary there is dedicated functionality part where functions can be offloaded dynamically. So this might be suited here.

A particularly good point to start here are likely the crytographic functions (so: the ones which run long), like signature verification, hashing, KZG, ...

Extract EVM Execution

While this might be a possible thing, this is a significantly bigger task and needs some several preparatory tasks/refactoring to get this handlable. Basically the "data-intertwistings" between the vmexecution.ts code and the rest of the client must be significantly and sustainably reduced and simplified, likely in several refactoring rounds.

Before we start on this we should likely go through this step-by-step and write down a separate "sub" issue on this respectively check for feasibility and how practical this remains, and identify the separate tasks necessary. Otherwise it might very well be possible that we ran into some dead end on step 9 from 12 and already have some significant amount of ressources wasted.

So, basically one needs to look closely into VMExecution, see what data is necessary there to both be passed in and get out and see if this can be practically changed to simple data not containing any functions.

Some steps which are likely possible:

Encapsulating the client config data in a dict separate from the broader Config class, pass only that
Pass only the "core" Common settings (and maybe also: combine this in Common as a combined dict), like chain, hf, EIP and use that

Note 1: These kind of goals actually also are strongly aligned with our tree shaking goals, so making the libraries lighter in itself. So if we go (in extreme case) "pure data" for Common (the thing passed around is basically only the 5 important different data points with no functions), the whole thing itself will get a lot lighter, which will help (e.g. performance) in a lot of places and also will help here. Same goes (in the extreme case) e.g. for txs and blocks.

Note 2: Since the EVM execution is such a heavy thing to separate (with potential very strong effects), it might (will likely) be worth to add some new additional steps (e.g. re-serialize a block to pass over the worker if necessary), since the stuff saved will very much outweight the additional costs.

holgerd77 commented 1 month ago

Ok, here is a - a the end working (was complicated enough) - workerpool test with a simple keccak256 call, increasing the execution time from 80 to 120 ms: #3697 😂.

jochem-brouwer commented 1 month ago

Hi Holger, thanks for writing this down, I actually wanted to take a deep dive into this shortly but this is already a great start :heart_eyes:

Some notes which I encountered reading your post:

In Node.js everything runs in parallel, except your code. What this means is that all I/O code that you write in Node.js is non-blocking, while (conversely) all non-I/O code that you write in Node.js is blocking.

(from https://www.npmjs.com/package/workerpool)

This thus means that writing a worker pool to offload networking requests is doing double work (nodejs already does this), but, processing the responses of this network pool does not (since the responses end up in the same pool as they were requested in, i.e. currently our only core). So, if we have devp2p requests which take a lot of computation time to process, we could offload this in worker pools (I currently do not think that processing any devp2p request is CPU-intensive, it is rather that we have to request-and-respond to a big count of requests!)

Sharing config and common

While we indeed cannot share config/common since these have methods, I don't think sharing the config/common backend themselves will be hard, since this is "just" copying over the parameters (and then loading them into the right class so we can also use the methods on top of the parameters/data)

Starting point

I think the best starting point to actually see progress here is indeed extracting VMExecution. I would think of this as a different viewpoint here to make it "simpler" to reason about. I would suggest to think of actually adding a thread here which would then activate the VMExectution. Here is the approach I would take:

Do not edit VMExecution as much, except the methods which call into VM.runBlock are now used as "offloaders" to the worker thread(s?) to signal those to start doing the job. These methods are:

  /**
   * Executes the block, runs the necessary verification on it,
   * and persists the block and the associate state into the database.
   * The key difference is it won't do the canonical chain updating.
   * It relies on the additional {@link VMExecution.setHead} call to finalize
   * the entire procedure.
   * @param receipts If we built this block, pass the receipts to not need to run the block again
   * @param optional param if runWithoutSetHead should block for execution
   * @param optional param if runWithoutSetHead should skip putting block into chain
   * @returns if the block was executed or not, throws on block execution failure
   */
  async runWithoutSetHead(
    opts: RunBlockOpts & { parentBlock?: Block },
    receipts?: TxReceipt[],
    blocking: boolean = false,
    skipBlockchain: boolean = false,
  ): Promise<boolean> {

  /**
   * Runs the VM execution
   * @param loop Whether to continue iterating until vm head equals chain head (default: true)
   * @returns number of blocks executed
   */
  async run(loop = true, runOnlyBatched = false): Promise<number> {

  /**
   * Execute a range of blocks on a copy of the VM
   * without changing any chain or client state
   *
   * Possible input formats:
   *
   * - Single block, '5'
   * - Range of blocks, '5-10'
   */
  async executeBlocks(first: number, last: number, txHashes: string[]) {

I would approach this as this: do NOT edit the logic of these methods at all, however, ONLY when calling into VM.runBlock (only 3 occurences thus in VMExecution listed in the methods above) offload the params into the worker thread. I think that the returned value of VM.runBlock can be stringified (so we can share those among the processes), so the worker process would run the block, stringify the result, and then send it back to the main client thread where it will then be converted to the correct object in the main thread.

This means that the worker pool would need to setup the VM with the right params (things which come to mind here is: the Common, and connecting to the DB). I think the hardest part here would be to ensure that it is connected to the DB (but, it as linked in the main post we can use LevelDB in multiple thread, maybe using multileveldown). If we do this correctly, then if the worker thread gets a call (i.e. runBlock is called from the client main thread) it should be able to run the VM out of the box.

I think this would be a great starting point and this would likely also pay off, since this offloads executing blocks in the VM directly to another thread, i.e. each time a runBlock is called it will not (sequentially) full to Promise event queue. I think we can directly see effects on devnets, if the tx activity there is somewhat low-ish then I think we should be able to stay synced!!

holgerd77 commented 1 month ago

Hi Jochem, thanks for directly picking this up and thinking this further along!

I would agree that extracting VMExecution would bring the most benefit and might be the most approachable thing right now. But just be warned: if one scrolls through the code a bit in some slow mode, one (re-)discovers that there are a lot of outer dependencies, references to other objects (ReceiptManager, Preimages, Config, Chain,....) and these again have other connections within the client. I would think when we start with this task directly things are getting so complex pretty fast that it becomes too complex too handle and there is a strong likelihood that we fail to do. I would compare this a bit with the VM test runner situation, where we tried to refactor and it was ad-hoc just not possible. So I would rather think there are 5-10 separate refactoring tasks before needed, to get this code under control again. But I also totally do not want to talk you out of that and you are super-invited to proof me wrong 😁, and in the worse case (that you realize its not possible in one go) we will nevertheless have collected a lot of experiences along the road which we can further build upon.

holgerd77 commented 1 month ago

So, as you can "read through the lines" (haha), I've gotten a bit pessimistic on this (again, were there before) in general, but again: worth to further explore.

I for my side will concentrate a bit on lower level continued performance/load optimizations in the next time, think there is still some room there, e.g. optimizing the devp2p load a bit better with things like maybe slowing down discovery (this is pretty "hammery") when more peers are found, with some more differentiated mechanism, things like that.

Also I'm still impressed by these "super slow down" (so basically: in the client runs (e.g. this case Gajinder showed) execution is just not getting "a bit slower" - by some few seconds or so - but literally completely "breaks together" more or less). I wonder if it is possible to fully debug/explore what is really happening there, this might also be something which is more like a "bug" and which can be fixed in some way.

But all nothing to put "against multi-threading" or the tasks from here, simply mentioning a bit what I am planning. 🙂

ethereumjs / ethereumjs-monorepo