Parallelize Memory Trace Processing

Issue

Approach

Parallelized the processing of registers and RAM, improving the performance of the memory_trace_processing segment (~30% improvement on M2 Pro). However, the overall performance is still largely bottlenecked by the RAM task, which takes longer to complete compared to the register task.

Ideally, the next step would be to further subdivide the RAM task into smaller subtasks based on address ranges to distribute the workload across multiple cores. However, testing with fib_e2e and sha3_e2e revealed an uneven distribution of RAM accesses across the address space. That means, if we break down the >=32 address space into N fixed-sized blocks, for example, and assign one CPU per block, then practically only 2-3 cores will be working while the rest will sit idle.

a16z / jolt

Parallelize Memory Trace Processing #338

Issue

Approach