Closed alxiong closed 6 months ago
I was flooded by profiling the advz dispersal
cargo test --release -p jf-primitives --features test-srs,icicle,kzg-print-trace,parallel,gpu-vid -- disperse_timer --ignored --nocapture
It's calling
batch_commit
which launches manycommit
s in parallel, and each of them will start a timer in a nested manner b/ckzg-print-trace
feature is on. Should we remove the inner timers incommit
, and add cfg feature flags inbatch_commit
timers?
nvm we shouldn't have kzg-print-trace
on
My local run results are listed below. Batch commit performance is still not perfect. Will check later
Start: KZG10::Setup with prover degree 256 and verifier degree 1
··Start: Generating powers of G
··End: Generating powers of G ..................................................3.119ms
End: KZG10::Setup with prover degree 256 and verifier degree 1 .................4.762ms
Start: VID disperse 33554432 payload bytes to 512 nodes
··Start: encode payload bytes into polynomials
··End: encode payload bytes into polynomials ...................................67.697ms
··Start: compute all storage node evals for 4229 polynomials with 256 coefficients
··End: compute all storage node evals for 4229 polynomials with 256 coefficients 147.929ms
··Start: compute merkle root of all storage node evals
··End: compute merkle root of all storage node evals ...........................74.827ms
··Start: compute 4229 KZG commitments
··End: compute 4229 KZG commitments ............................................138.304ms
··Start: compute aggregate proofs for 512 storage nodes
····Start: compute h_poly
····End: compute h_poly ........................................................53.948ms
····Start: gen eval proofs with parallel_factor 2 and num_points 512
····End: gen eval proofs with parallel_factor 2 and num_points 512 .............50.675ms
··End: compute aggregate proofs for 512 storage nodes ..........................106.559ms
··Start: assemble shares for dispersal
··End: assemble shares for dispersal ...........................................37.821ms
End: VID disperse 33554432 payload bytes to 512 nodes ..........................603.895ms
Start: VID disperse 33554432 payload bytes to 512 nodes
··Start: encode payload bytes into polynomials
··End: encode payload bytes into polynomials ...................................28.887ms
··Start: compute all storage node evals for 4229 polynomials with 256 coefficients
··End: compute all storage node evals for 4229 polynomials with 256 coefficients 56.072ms
··Start: compute merkle root of all storage node evals
··End: compute merkle root of all storage node evals ...........................66.781ms
··Start: compute 4229 KZG commitments
····Start: batch commit 4229 polynomials
····End: batch commit 4229 polynomials .........................................1.532s
··End: compute 4229 KZG commitments ............................................1.532s
··Start: compute aggregate proofs for 512 storage nodes
····Start: compute h_poly
····End: compute h_poly ........................................................58.700ms
····Start: gen eval proofs with parallel_factor 2 and num_points 512
····End: gen eval proofs with parallel_factor 2 and num_points 512 .............49.792ms
··End: compute aggregate proofs for 512 storage nodes ..........................110.447ms
··Start: assemble shares for dispersal
··End: assemble shares for dispersal ...........................................14.715ms
End: VID disperse 33554432 payload bytes to 512 nodes ..........................1.836s
Description
closes: #521
Major changes include:
struct Advz
tostruct AdvzInternal
, including all of its implementationstype Advz
(which is a direct replacement of the original struct with the same interface and generic param) andtype AdvzGPU
which contains more icicle-related type declarationtrait MaybeGPU
to allow specialized trait bounds onkzg_batch_commit()
gpu-vid
struct HostOrSlice
which affects how we writeGPUCommit::commit_on_gpu
, we changed the API ofVidScheme::commit_only/disperse(&mut self)
to accept&mut self
instead of the&self
. Note that for the CPU version, even though we pass in a mutable reference, we never mutate it in the code! it's only the GPU version that requires a mutable reference.Benchmark
TL;DR:
Vid::commit_only()
improved by ~6.4x,Vid::disperse()
improve by ~5x.note: this speedup is not entirely accurate since my AWS instance has only 4 cores, thus the CPU parallelized part is not very fast. If we run our code on a 8~16 core CPU + 1 GPU, I'm fairly confident our disperse of 33MB is ~1 sec.
In the following section, I add a line break the first group is GPU, the second is CPU. you should probably tell by their numbers as well.
k=256, n=512, m=1, |B| = 33MB
k=32, n=128, m=4, |B| = 33MB
❗ I believe this setup/regime is closer to what we will actually use. With limited CPU cores, our
disperse
takes 1.9 sec.Before we can merge this PR, please make sure that all the following items have been checked off. If any of the checklist items are not applicable, please leave them but write a little note why.
Pending
section inCHANGELOG.md
Files changed
in the GitHub PR explorer