Closed jsign closed 11 months ago
@kevaundray, I've tested this in Kaustinen and all is looking fine, so I guess I can't get a better signal to make this ready for review.
I guess the ultimate signal will be when more clients (not using go-ipa) will use Kaustinen, but that can take a while. And, we'll eventually find out anyway.
This PR improves the performance of the proof generator and verifier. There were many changes that I'll describe below.
Which are the changes?
Optimizations that improved the performance of both the prover and verifier:
strings
, which eventually needed to escape to the heap since theHasher
requires[]bytes
. Also, the labels are fixed (e.g:C
,w
,t
,L
,R
, and domain separators) which doesn't even make sense that they escape once per transcript write. (e.g: for 100k points, we'd be escaping to the heap constant strings). This could be easily fixed by defining constant and not requiringstrings
since the hasher doesn't mutate their values.The first point massively reduced the number of allocations, creating less memory garbage and thus less GC pressure for the clients. This is good for secondary order effects (i.e: GC runs less often and thus burns less CPU).
Optimizations that improved the prover (apart from the above ones):
Z!=1
. This is the case since there were operations at the Verkle Tree data structure layer (e.g: updating commitments after tree writes). When including this point in the FS transcript we need to transform them to affine. Despite already batching all the inverses (e.g: for transforming 10k openings, we did one inversion and not 10k), the Montgomery trick still requires a decent amount of multiplications. All these are independent of each other, so this now was parallelized.Prover benchmarks
(Note: 128k openings are more than double the worst-case scenario estimations, I think? We'll have more idea after some Kaustinen inspection, probably. In any case, I included this case as a wild upper bound)
Here, I show benchmarks (before/after) for the prover in two setups:
AMD Ryzen 7 3800XT prover:
Notes:
time/op
got a massive speedup as expected due to parallelization and the FS-buffering (plus probably less allocs/GC pressure)alloc (MB)/op
(i.e: total memory allocated) got a bump due to parallelization since each goroutine requires memory at the same time. In relative terms, it looks like a decent bump, but in absolute terms I don't think it is a big deal. Clients could limit this by maybe allowing them to configure the amount of parallelization, but that's a tradeoffs between speed and memory usage. Note that in the above CPU we have 16 virtual cores (e.g: see the Rock5B benchmark below; since it has 8 cores (instead of 16), it uses less memory).alloc (count)/op
, this got a massive reduction. We're doing fewer allocations, meaning the GC must clean up less garbage. Note that we're generating less garbage, but using more memory. This is fine in the sense that the GC overhead is related to the amount of allocations done and not to their size.Note: We could push further to reducing allocations, which is always an interesting dance in Go... that could mean introducing extra complexity, which I'm not convinced is justified for what we might gain. So let's try to get some signal that is worth doing.
Rock5B prover:
Verifier benchmarks
AMD Ryzen 7 3800XT verifier:
Rock5B verifier:
Tangent: verifier vs prover speed?
I'd a personal feeling that the verifier isn't that faster than the prover. It's faster by double-digit %, but not incredibly faster. For the Rock5B case, the difference is a bit bigger. e.g: for 64k the Rock5B prover takes 890ms and verifier 617ms (1.44x). For my CPU, the prover takes 160ms and verifier 119ms (1.34x).
Taking a further look at, for example, the 100k openings case, more than half of that time is spent in an MSM of length 100k (we need this to compute
E
, i.e: the linear combination ofCs
with powers ofr
). This is parallelized by gnark-crypto, but still is a 100k MSM which is quite massive, so I guess it makes sense. A big part of the rest is appending 100k elements to the FS-transcript (i.e: for each opening, we have to appendC
,y
andz
, so thats32 bytes * 3 * 100K
, which is a decent amount of stuff to serialize and hash).The prover is quite fast since, apart from all the tricks and now parallelization, it can do most of the stuff in the evaluation form with the "grouping by evaluation point" that we do. So no MSM is required there. (Still have to append 100k (C+y+z), too, so that's the same for both).
Anyway, this is just a comment if this was surprising for some other reader. The verifier "IPA verification" part is very fast and constant (as expected, <10ms), so most of the overhead comes from the Multiproof part that is dependent on the number of openings.
The last note is that the Go standard library implementation of sha256, doesn't leverage SIMD instructions for sha256 if available in the CPU. I think this is planned in the next version of Go. I did a test with a Go library that does that, and the FS-part gets a decent speedup; but quite honestly, I'd prefer to stick to the Go standard library of sha256 since it's quite a delicate dependency to just save a dozen of ms (tested in my CPU).
TODOs
I'll keep this as a draft until: