multiproof: improve performance of prover and verifier

This PR improves the performance of the proof generator and verifier. There were many changes that I'll describe below.

Which are the changes?

Optimizations that improved the performance of both the prover and verifier:

Avoid a lot of Fiat-Shamir transcript allocations: For labels, we used strings, which eventually needed to escape to the heap since the Hasher requires []bytes. Also, the labels are fixed (e.g: C, w, t, L, R, and domain separators) which doesn't even make sense that they escape once per transcript write. (e.g: for 100k points, we'd be escaping to the heap constant strings). This could be easily fixed by defining constant and not requiring strings since the hasher doesn't mutate their values.
Include buffering in Fiat-Shamir transcript: for many openings (i.e: >10K), we add a lot of stuff to the transcript before pulling a challenge. These writes are quite short e.g: for 10k openings, we append 10k * (a commitment + an evaluation + an evaluation point) to the transcript (apart from other stuff, but this dominates), so that means more than 30k quite small writes are done to the sha256 (that acts as a random oracle). That's quite inefficient. I included a buffer where we collect all writes and flush only once to sha256 when pulling a challenge. From the FS client, it's completely irrelevant since the effects of FS writes can only be observed when pulling a new challenge. (Also, we had tests to double-check the state of the transcript, so this is tested).

The first point massively reduced the number of allocations, creating less memory garbage and thus less GC pressure for the clients. This is good for secondary order effects (i.e: GC runs less often and thus burns less CPU).

Optimizations that improved the prover (apart from the above ones):

Parallelize Projective to Affine for transcript commitments: The prover receives the commitments from all the openings in projective form with Z!=1. This is the case since there were operations at the Verkle Tree data structure layer (e.g: updating commitments after tree writes). When including this point in the FS transcript we need to transform them to affine. Despite already batching all the inverses (e.g: for transforming 10k openings, we did one inversion and not 10k), the Montgomery trick still requires a decent amount of multiplications. All these are independent of each other, so this now was parallelized.
Parallelize grouping polynomials by evaluation point: long ago I introduced in this repo a trick that @kevaundray applied in the Rust version, which is grouping the polynomials by evaluation point since the evaluation domain is bounded, which simplifies a lot of operations after. This grouping was done serially; if careful, we can parallelize it correctly. ("careful" means noticing that in the VKT, there's some bias towards evaluated points in the domain 0, 1, and 2. This isn't a big deal, but it's wrong to assume they're uniformly distributed to have good balancing).

Prover benchmarks

(Note: 128k openings are more than double the worst-case scenario estimations, I think? We'll have more idea after some Kaustinen inspection, probably. In any case, I included this case as a wild upper bound)

Here, I show benchmarks (before/after) for the prover in two setups:

AMD Ryzen 7 3800XT: an 8-core (16-threads), high-ish-end desktop CPU. (~400usd CPU)
Rock5B: a single-board computer which is probably the lowest setup we can imagine being run (not sure it will survive 4844+VKT, but is a good lowest-setup target) (~200usd for all the card (CPU+all the rest))

AMD Ryzen 7 3800XT prover:

name                                   old time/op    new time/op    delta
ProofGeneration/numopenings=2000-16      64.1ms ± 2%    54.3ms ± 1%   -15.38%  (p=0.000 n=10+10)
ProofGeneration/numopenings=16000-16      227ms ± 1%      81ms ± 2%   -64.14%  (p=0.000 n=9+10)
ProofGeneration/numopenings=32000-16      524ms ± 1%     108ms ± 1%   -79.43%  (p=0.000 n=9+9)
ProofGeneration/numopenings=64000-16      1.48s ± 4%     0.16s ± 2%   -89.21%  (p=0.000 n=10+9)
ProofGeneration/numopenings=128000-16     4.79s ± 3%     0.26s ± 1%   -94.55%  (p=0.000 n=10+9)

name                                   old alloc/op   new alloc/op   delta
ProofGeneration/numopenings=2000-16      5.50MB ± 0%   16.61MB ± 0%  +201.83%  (p=0.000 n=10+9)
ProofGeneration/numopenings=16000-16     8.76MB ± 0%   41.81MB ± 0%  +377.47%  (p=0.000 n=10+10)
ProofGeneration/numopenings=32000-16     12.8MB ± 0%    48.0MB ± 0%  +276.44%  (p=0.000 n=10+10)
ProofGeneration/numopenings=64000-16     21.5MB ± 0%    59.7MB ± 0%  +177.16%  (p=0.000 n=10+9)
ProofGeneration/numopenings=128000-16    38.0MB ± 0%    83.3MB ± 0%  +119.26%  (p=0.000 n=10+8)

name                                   old allocs/op  new allocs/op  delta
ProofGeneration/numopenings=2000-16       17.3k ± 0%      6.6k ± 0%   -61.76%  (p=0.000 n=10+10)
ProofGeneration/numopenings=16000-16       101k ± 0%        9k ± 0%   -91.03%  (p=0.000 n=10+7)
ProofGeneration/numopenings=32000-16       197k ± 0%        9k ± 0%   -95.36%  (p=0.000 n=10+10)
ProofGeneration/numopenings=64000-16       389k ± 0%        9k ± 0%   -97.65%  (p=0.000 n=10+9)
ProofGeneration/numopenings=128000-16      773k ± 0%        9k ± 0%   -98.82%  (p=0.000 n=10+9)

Notes:

time/op got a massive speedup as expected due to parallelization and the FS-buffering (plus probably less allocs/GC pressure)
alloc (MB)/op (i.e: total memory allocated) got a bump due to parallelization since each goroutine requires memory at the same time. In relative terms, it looks like a decent bump, but in absolute terms I don't think it is a big deal. Clients could limit this by maybe allowing them to configure the amount of parallelization, but that's a tradeoffs between speed and memory usage. Note that in the above CPU we have 16 virtual cores (e.g: see the Rock5B benchmark below; since it has 8 cores (instead of 16), it uses less memory).
alloc (count)/op, this got a massive reduction. We're doing fewer allocations, meaning the GC must clean up less garbage. Note that we're generating less garbage, but using more memory. This is fine in the sense that the GC overhead is related to the amount of allocations done and not to their size.

Note: We could push further to reducing allocations, which is always an interesting dance in Go... that could mean introducing extra complexity, which I'm not convinced is justified for what we might gain. So let's try to get some signal that is worth doing.

Rock5B prover:

name                                  old time/op    new time/op    delta
ProofGeneration/numopenings=2000-8       367ms ± 4%     359ms ± 6%      ~     (p=0.156 n=9+10)
ProofGeneration/numopenings=16000-8      1.03s ± 2%     0.45s ± 4%   -56.23%  (p=0.000 n=10+10)
ProofGeneration/numopenings=32000-8      1.99s ± 1%     0.62s ± 3%   -68.78%  (p=0.000 n=10+10)
ProofGeneration/numopenings=64000-8      4.59s ± 1%     0.89s ± 4%   -80.58%  (p=0.000 n=10+10)
ProofGeneration/numopenings=128000-8     12.5s ± 0%      1.5s ± 2%   -87.85%  (p=0.000 n=10+10)

name                                  old alloc/op   new alloc/op   delta
ProofGeneration/numopenings=2000-8      5.52MB ± 0%   14.33MB ± 0%  +159.74%  (p=0.000 n=8+10)
ProofGeneration/numopenings=16000-8     9.18MB ± 0%   25.61MB ± 0%  +179.12%  (p=0.000 n=10+9)
ProofGeneration/numopenings=32000-8     13.3MB ± 0%    31.7MB ± 0%  +138.45%  (p=0.000 n=10+10)
ProofGeneration/numopenings=64000-8     21.5MB ± 0%    43.6MB ± 0%  +102.80%  (p=0.000 n=10+9)
ProofGeneration/numopenings=128000-8    38.0MB ± 0%    69.6MB ± 0%   +83.46%  (p=0.000 n=10+10)

name                                  old allocs/op  new allocs/op  delta
ProofGeneration/numopenings=2000-8       17.2k ± 0%      6.2k ± 0%   -63.96%  (p=0.000 n=10+10)
ProofGeneration/numopenings=16000-8       101k ± 0%        7k ± 0%   -93.14%  (p=0.000 n=10+9)
ProofGeneration/numopenings=32000-8       197k ± 0%        7k ± 0%   -96.47%  (p=0.000 n=10+10)
ProofGeneration/numopenings=64000-8       389k ± 0%        7k ± 0%   -98.21%  (p=0.000 n=10+10)
ProofGeneration/numopenings=128000-8      773k ± 0%        7k ± 0%   -99.10%  (p=0.000 n=10+10)

Verifier benchmarks

AMD Ryzen 7 3800XT verifier:

name                                     old time/op    new time/op    delta
ProofVerification/numopenings=2000-16      13.3ms ± 2%     8.9ms ± 2%  -33.10%  (p=0.000 n=9+10)
ProofVerification/numopenings=16000-16     59.8ms ± 2%    37.2ms ± 2%  -37.88%  (p=0.000 n=10+10)
ProofVerification/numopenings=32000-16      111ms ± 1%      67ms ± 2%  -39.90%  (p=0.000 n=10+10)
ProofVerification/numopenings=64000-16      208ms ± 2%     119ms ± 2%  -42.69%  (p=0.000 n=10+9)
ProofVerification/numopenings=128000-16     392ms ± 1%     214ms ± 1%  -45.45%  (p=0.000 n=9+10)

name                                     old alloc/op   new alloc/op   delta
ProofVerification/numopenings=2000-16      1.22MB ± 0%    1.50MB ± 0%  +23.17%  (p=0.000 n=10+10)
ProofVerification/numopenings=16000-16     7.84MB ± 0%   10.11MB ± 0%  +28.97%  (p=0.000 n=9+10)
ProofVerification/numopenings=32000-16     15.4MB ± 0%    19.9MB ± 0%  +29.54%  (p=0.000 n=10+10)
ProofVerification/numopenings=64000-16     30.5MB ± 0%    39.6MB ± 0%  +29.79%  (p=0.000 n=6+10)
ProofVerification/numopenings=128000-16    60.8MB ± 0%    79.0MB ± 0%  +29.92%  (p=0.000 n=10+10)

name                                     old allocs/op  new allocs/op  delta
ProofVerification/numopenings=2000-16       13.1k ± 0%      1.0k ± 0%  -92.21%  (p=0.000 n=10+10)
ProofVerification/numopenings=16000-16      97.1k ± 0%      1.0k ± 0%  -98.96%  (p=0.000 n=10+10)
ProofVerification/numopenings=32000-16       193k ± 0%        1k ± 0%  -99.48%  (p=0.000 n=10+10)
ProofVerification/numopenings=64000-16       385k ± 0%        1k ± 0%  -99.74%  (p=0.000 n=6+10)
ProofVerification/numopenings=128000-16      769k ± 0%        1k ± 0%  -99.87%  (p=0.000 n=10+10)

Rock5B verifier:

name                                    old time/op    new time/op    delta
ProofVerification/numopenings=2000-8      79.4ms ± 8%    57.4ms ±10%  -27.76%  (p=0.000 n=10+10)
ProofVerification/numopenings=16000-8      275ms ± 5%     210ms ± 7%  -23.46%  (p=0.000 n=10+10)
ProofVerification/numopenings=32000-8      502ms ± 7%     348ms ± 4%  -30.75%  (p=0.000 n=10+10)
ProofVerification/numopenings=64000-8      907ms ± 3%     617ms ± 4%  -31.97%  (p=0.000 n=10+10)
ProofVerification/numopenings=128000-8     1.77s ± 2%     1.18s ± 2%  -33.47%  (p=0.000 n=8+10)

name                                    old alloc/op   new alloc/op   delta
ProofVerification/numopenings=2000-8      1.21MB ± 0%    1.50MB ± 0%  +23.21%  (p=0.000 n=9+9)
ProofVerification/numopenings=16000-8     7.84MB ± 0%   10.11MB ± 0%  +28.98%  (p=0.000 n=10+10)
ProofVerification/numopenings=32000-8     15.4MB ± 0%    19.9MB ± 0%  +29.55%  (p=0.000 n=9+10)
ProofVerification/numopenings=64000-8     30.5MB ± 0%    39.6MB ± 0%  +29.80%  (p=0.000 n=10+10)
ProofVerification/numopenings=128000-8    60.8MB ± 0%    79.0MB ± 0%  +29.92%  (p=0.000 n=10+10)

name                                    old allocs/op  new allocs/op  delta
ProofVerification/numopenings=2000-8       13.1k ± 0%      1.0k ± 0%  -92.44%  (p=0.000 n=10+9)
ProofVerification/numopenings=16000-8      97.0k ± 0%      1.0k ± 0%  -98.99%  (p=0.000 n=10+10)
ProofVerification/numopenings=32000-8       193k ± 0%        1k ± 0%  -99.49%  (p=0.000 n=9+10)
ProofVerification/numopenings=64000-8       385k ± 0%        1k ± 0%  -99.75%  (p=0.000 n=10+10)
ProofVerification/numopenings=128000-8      769k ± 0%        1k ± 1%  -99.87%  (p=0.000 n=10+10)

Tangent: verifier vs prover speed?

I'd a personal feeling that the verifier isn't that faster than the prover. It's faster by double-digit %, but not incredibly faster. For the Rock5B case, the difference is a bit bigger. e.g: for 64k the Rock5B prover takes 890ms and verifier 617ms (1.44x). For my CPU, the prover takes 160ms and verifier 119ms (1.34x).

Taking a further look at, for example, the 100k openings case, more than half of that time is spent in an MSM of length 100k (we need this to compute E, i.e: the linear combination of Cs with powers of r). This is parallelized by gnark-crypto, but still is a 100k MSM which is quite massive, so I guess it makes sense. A big part of the rest is appending 100k elements to the FS-transcript (i.e: for each opening, we have to append C, y and z, so thats 32 bytes * 3 * 100K, which is a decent amount of stuff to serialize and hash).

The prover is quite fast since, apart from all the tricks and now parallelization, it can do most of the stuff in the evaluation form with the "grouping by evaluation point" that we do. So no MSM is required there. (Still have to append 100k (C+y+z), too, so that's the same for both).

Anyway, this is just a comment if this was surprising for some other reader. The verifier "IPA verification" part is very fast and constant (as expected, <10ms), so most of the overhead comes from the Multiproof part that is dependent on the number of openings.

The last note is that the Go standard library implementation of sha256, doesn't leverage SIMD instructions for sha256 if available in the CPU. I think this is planned in the next version of Go. I did a test with a Go library that does that, and the FS-part gets a decent speedup; but quite honestly, I'd prefer to stick to the Go standard library of sha256 since it's quite a delicate dependency to just save a dozen of ms (tested in my CPU).

TODOs

I'll keep this as a draft until:

[x] I'll use a custom version of geth running this PR joining the new Kaustinen network (when ready), to double-check that things look fine. They should since we already have tests, but I'm a bit paranoid with these kinds of changes, so let's wait a bit until I can do that so I'll open when having extra "good signal".
[x] Create PR comments to help reviewing the PR.

crate-crypto / go-ipa