Parallelized the processing of registers and RAM, improving the performance of the memory_trace_processing segment (~30% improvement on M2 Pro). However, the overall performance is still largely bottlenecked by the RAM task, which takes longer to complete compared to the register task.
Ideally, the next step would be to further subdivide the RAM task into smaller subtasks based on address ranges to distribute the workload across multiple cores. However, testing with fib_e2e and sha3_e2e revealed an uneven distribution of RAM accesses across the address space. That means, if we break down the >=32 address space into N fixed-sized blocks, for example, and assign one CPU per block, then practically only 2-3 cores will be working while the rest will sit idle.
Issue
https://github.com/a16z/jolt/issues/292
Approach
Parallelized the processing of registers and RAM, improving the performance of the
memory_trace_processing
segment (~30% improvement on M2 Pro). However, the overall performance is still largely bottlenecked by the RAM task, which takes longer to complete compared to the register task.Ideally, the next step would be to further subdivide the RAM task into smaller subtasks based on address ranges to distribute the workload across multiple cores. However, testing with
fib_e2e
andsha3_e2e
revealed an uneven distribution of RAM accesses across the address space. That means, if we break down the >=32 address space into N fixed-sized blocks, for example, and assign one CPU per block, then practically only 2-3 cores will be working while the rest will sit idle.