crate-crypto / go-ipa

A Go implementation of cryptographic primitives for Verkle Trees
https://verkle.dev
Apache License 2.0
32 stars 14 forks source link

multiproof: improve performance of prover and verifier #63

Closed jsign closed 11 months ago

jsign commented 11 months ago

This PR improves the performance of the proof generator and verifier. There were many changes that I'll describe below.

Which are the changes?

Optimizations that improved the performance of both the prover and verifier:

The first point massively reduced the number of allocations, creating less memory garbage and thus less GC pressure for the clients. This is good for secondary order effects (i.e: GC runs less often and thus burns less CPU).

Optimizations that improved the prover (apart from the above ones):

Prover benchmarks

(Note: 128k openings are more than double the worst-case scenario estimations, I think? We'll have more idea after some Kaustinen inspection, probably. In any case, I included this case as a wild upper bound)

Here, I show benchmarks (before/after) for the prover in two setups:

AMD Ryzen 7 3800XT prover:

name                                   old time/op    new time/op    delta
ProofGeneration/numopenings=2000-16      64.1ms ± 2%    54.3ms ± 1%   -15.38%  (p=0.000 n=10+10)
ProofGeneration/numopenings=16000-16      227ms ± 1%      81ms ± 2%   -64.14%  (p=0.000 n=9+10)
ProofGeneration/numopenings=32000-16      524ms ± 1%     108ms ± 1%   -79.43%  (p=0.000 n=9+9)
ProofGeneration/numopenings=64000-16      1.48s ± 4%     0.16s ± 2%   -89.21%  (p=0.000 n=10+9)
ProofGeneration/numopenings=128000-16     4.79s ± 3%     0.26s ± 1%   -94.55%  (p=0.000 n=10+9)

name                                   old alloc/op   new alloc/op   delta
ProofGeneration/numopenings=2000-16      5.50MB ± 0%   16.61MB ± 0%  +201.83%  (p=0.000 n=10+9)
ProofGeneration/numopenings=16000-16     8.76MB ± 0%   41.81MB ± 0%  +377.47%  (p=0.000 n=10+10)
ProofGeneration/numopenings=32000-16     12.8MB ± 0%    48.0MB ± 0%  +276.44%  (p=0.000 n=10+10)
ProofGeneration/numopenings=64000-16     21.5MB ± 0%    59.7MB ± 0%  +177.16%  (p=0.000 n=10+9)
ProofGeneration/numopenings=128000-16    38.0MB ± 0%    83.3MB ± 0%  +119.26%  (p=0.000 n=10+8)

name                                   old allocs/op  new allocs/op  delta
ProofGeneration/numopenings=2000-16       17.3k ± 0%      6.6k ± 0%   -61.76%  (p=0.000 n=10+10)
ProofGeneration/numopenings=16000-16       101k ± 0%        9k ± 0%   -91.03%  (p=0.000 n=10+7)
ProofGeneration/numopenings=32000-16       197k ± 0%        9k ± 0%   -95.36%  (p=0.000 n=10+10)
ProofGeneration/numopenings=64000-16       389k ± 0%        9k ± 0%   -97.65%  (p=0.000 n=10+9)
ProofGeneration/numopenings=128000-16      773k ± 0%        9k ± 0%   -98.82%  (p=0.000 n=10+9)

Notes:

Note: We could push further to reducing allocations, which is always an interesting dance in Go... that could mean introducing extra complexity, which I'm not convinced is justified for what we might gain. So let's try to get some signal that is worth doing.

Rock5B prover:

name                                  old time/op    new time/op    delta
ProofGeneration/numopenings=2000-8       367ms ± 4%     359ms ± 6%      ~     (p=0.156 n=9+10)
ProofGeneration/numopenings=16000-8      1.03s ± 2%     0.45s ± 4%   -56.23%  (p=0.000 n=10+10)
ProofGeneration/numopenings=32000-8      1.99s ± 1%     0.62s ± 3%   -68.78%  (p=0.000 n=10+10)
ProofGeneration/numopenings=64000-8      4.59s ± 1%     0.89s ± 4%   -80.58%  (p=0.000 n=10+10)
ProofGeneration/numopenings=128000-8     12.5s ± 0%      1.5s ± 2%   -87.85%  (p=0.000 n=10+10)

name                                  old alloc/op   new alloc/op   delta
ProofGeneration/numopenings=2000-8      5.52MB ± 0%   14.33MB ± 0%  +159.74%  (p=0.000 n=8+10)
ProofGeneration/numopenings=16000-8     9.18MB ± 0%   25.61MB ± 0%  +179.12%  (p=0.000 n=10+9)
ProofGeneration/numopenings=32000-8     13.3MB ± 0%    31.7MB ± 0%  +138.45%  (p=0.000 n=10+10)
ProofGeneration/numopenings=64000-8     21.5MB ± 0%    43.6MB ± 0%  +102.80%  (p=0.000 n=10+9)
ProofGeneration/numopenings=128000-8    38.0MB ± 0%    69.6MB ± 0%   +83.46%  (p=0.000 n=10+10)

name                                  old allocs/op  new allocs/op  delta
ProofGeneration/numopenings=2000-8       17.2k ± 0%      6.2k ± 0%   -63.96%  (p=0.000 n=10+10)
ProofGeneration/numopenings=16000-8       101k ± 0%        7k ± 0%   -93.14%  (p=0.000 n=10+9)
ProofGeneration/numopenings=32000-8       197k ± 0%        7k ± 0%   -96.47%  (p=0.000 n=10+10)
ProofGeneration/numopenings=64000-8       389k ± 0%        7k ± 0%   -98.21%  (p=0.000 n=10+10)
ProofGeneration/numopenings=128000-8      773k ± 0%        7k ± 0%   -99.10%  (p=0.000 n=10+10)

Verifier benchmarks

AMD Ryzen 7 3800XT verifier:

name                                     old time/op    new time/op    delta
ProofVerification/numopenings=2000-16      13.3ms ± 2%     8.9ms ± 2%  -33.10%  (p=0.000 n=9+10)
ProofVerification/numopenings=16000-16     59.8ms ± 2%    37.2ms ± 2%  -37.88%  (p=0.000 n=10+10)
ProofVerification/numopenings=32000-16      111ms ± 1%      67ms ± 2%  -39.90%  (p=0.000 n=10+10)
ProofVerification/numopenings=64000-16      208ms ± 2%     119ms ± 2%  -42.69%  (p=0.000 n=10+9)
ProofVerification/numopenings=128000-16     392ms ± 1%     214ms ± 1%  -45.45%  (p=0.000 n=9+10)

name                                     old alloc/op   new alloc/op   delta
ProofVerification/numopenings=2000-16      1.22MB ± 0%    1.50MB ± 0%  +23.17%  (p=0.000 n=10+10)
ProofVerification/numopenings=16000-16     7.84MB ± 0%   10.11MB ± 0%  +28.97%  (p=0.000 n=9+10)
ProofVerification/numopenings=32000-16     15.4MB ± 0%    19.9MB ± 0%  +29.54%  (p=0.000 n=10+10)
ProofVerification/numopenings=64000-16     30.5MB ± 0%    39.6MB ± 0%  +29.79%  (p=0.000 n=6+10)
ProofVerification/numopenings=128000-16    60.8MB ± 0%    79.0MB ± 0%  +29.92%  (p=0.000 n=10+10)

name                                     old allocs/op  new allocs/op  delta
ProofVerification/numopenings=2000-16       13.1k ± 0%      1.0k ± 0%  -92.21%  (p=0.000 n=10+10)
ProofVerification/numopenings=16000-16      97.1k ± 0%      1.0k ± 0%  -98.96%  (p=0.000 n=10+10)
ProofVerification/numopenings=32000-16       193k ± 0%        1k ± 0%  -99.48%  (p=0.000 n=10+10)
ProofVerification/numopenings=64000-16       385k ± 0%        1k ± 0%  -99.74%  (p=0.000 n=6+10)
ProofVerification/numopenings=128000-16      769k ± 0%        1k ± 0%  -99.87%  (p=0.000 n=10+10)

Rock5B verifier:

name                                    old time/op    new time/op    delta
ProofVerification/numopenings=2000-8      79.4ms ± 8%    57.4ms ±10%  -27.76%  (p=0.000 n=10+10)
ProofVerification/numopenings=16000-8      275ms ± 5%     210ms ± 7%  -23.46%  (p=0.000 n=10+10)
ProofVerification/numopenings=32000-8      502ms ± 7%     348ms ± 4%  -30.75%  (p=0.000 n=10+10)
ProofVerification/numopenings=64000-8      907ms ± 3%     617ms ± 4%  -31.97%  (p=0.000 n=10+10)
ProofVerification/numopenings=128000-8     1.77s ± 2%     1.18s ± 2%  -33.47%  (p=0.000 n=8+10)

name                                    old alloc/op   new alloc/op   delta
ProofVerification/numopenings=2000-8      1.21MB ± 0%    1.50MB ± 0%  +23.21%  (p=0.000 n=9+9)
ProofVerification/numopenings=16000-8     7.84MB ± 0%   10.11MB ± 0%  +28.98%  (p=0.000 n=10+10)
ProofVerification/numopenings=32000-8     15.4MB ± 0%    19.9MB ± 0%  +29.55%  (p=0.000 n=9+10)
ProofVerification/numopenings=64000-8     30.5MB ± 0%    39.6MB ± 0%  +29.80%  (p=0.000 n=10+10)
ProofVerification/numopenings=128000-8    60.8MB ± 0%    79.0MB ± 0%  +29.92%  (p=0.000 n=10+10)

name                                    old allocs/op  new allocs/op  delta
ProofVerification/numopenings=2000-8       13.1k ± 0%      1.0k ± 0%  -92.44%  (p=0.000 n=10+9)
ProofVerification/numopenings=16000-8      97.0k ± 0%      1.0k ± 0%  -98.99%  (p=0.000 n=10+10)
ProofVerification/numopenings=32000-8       193k ± 0%        1k ± 0%  -99.49%  (p=0.000 n=9+10)
ProofVerification/numopenings=64000-8       385k ± 0%        1k ± 0%  -99.75%  (p=0.000 n=10+10)
ProofVerification/numopenings=128000-8      769k ± 0%        1k ± 1%  -99.87%  (p=0.000 n=10+10)

Tangent: verifier vs prover speed?

I'd a personal feeling that the verifier isn't that faster than the prover. It's faster by double-digit %, but not incredibly faster. For the Rock5B case, the difference is a bit bigger. e.g: for 64k the Rock5B prover takes 890ms and verifier 617ms (1.44x). For my CPU, the prover takes 160ms and verifier 119ms (1.34x).

Taking a further look at, for example, the 100k openings case, more than half of that time is spent in an MSM of length 100k (we need this to compute E, i.e: the linear combination of Cs with powers of r). This is parallelized by gnark-crypto, but still is a 100k MSM which is quite massive, so I guess it makes sense. A big part of the rest is appending 100k elements to the FS-transcript (i.e: for each opening, we have to append C, y and z, so thats 32 bytes * 3 * 100K, which is a decent amount of stuff to serialize and hash).

The prover is quite fast since, apart from all the tricks and now parallelization, it can do most of the stuff in the evaluation form with the "grouping by evaluation point" that we do. So no MSM is required there. (Still have to append 100k (C+y+z), too, so that's the same for both).

Anyway, this is just a comment if this was surprising for some other reader. The verifier "IPA verification" part is very fast and constant (as expected, <10ms), so most of the overhead comes from the Multiproof part that is dependent on the number of openings.

The last note is that the Go standard library implementation of sha256, doesn't leverage SIMD instructions for sha256 if available in the CPU. I think this is planned in the next version of Go. I did a test with a Go library that does that, and the FS-part gets a decent speedup; but quite honestly, I'd prefer to stick to the Go standard library of sha256 since it's quite a delicate dependency to just save a dozen of ms (tested in my CPU).

TODOs

I'll keep this as a draft until:

jsign commented 11 months ago

@kevaundray, I've tested this in Kaustinen and all is looking fine, so I guess I can't get a better signal to make this ready for review.

I guess the ultimate signal will be when more clients (not using go-ipa) will use Kaustinen, but that can take a while. And, we'll eventually find out anyway.