feat!: GPU accelerated VID disperse

alxiong commented 6 months ago

Description

closes: #521

Major changes include:

Rename original struct Advz to struct AdvzInternal, including all of its implementations
Introduce two type alias type Advz (which is a direct replacement of the original struct with the same interface and generic param) and type AdvzGPU which contains more icicle-related type declaration
Introduce trait MaybeGPU to allow specialized trait bounds on kzg_batch_commit()
- GPU version is enabled by feature flag gpu-vid
- ⚠️ unfortunately, due to suboptimal API design of ICICLE's struct HostOrSlice which affects how we write GPUCommit::commit_on_gpu, we changed the API of VidScheme::commit_only/disperse(&mut self) to accept &mut self instead of the &self. Note that for the CPU version, even though we pass in a mutable reference, we never mutate it in the code! it's only the GPU version that requires a mutable reference.
add profile code for GPU version of the vid

Benchmark

TL;DR: Vid::commit_only() improved by ~6.4x, Vid::disperse() improve by ~5x.

note: this speedup is not entirely accurate since my AWS instance has only 4 cores, thus the CPU parallelized part is not very fast. If we run our code on a 8~16 core CPU + 1 GPU, I'm fairly confident our disperse of 33MB is ~1 sec.

In the following section, I add a line break the first group is GPU, the second is CPU. you should probably tell by their numbers as well.

`k=256, n=512, m=1, |B| = 33MB`

running 1 test
Start:   KZG10::Setup with prover degree 256 and verifier degree 1
··Start:   Generating powers of G
··End:     Generating powers of G ..................................................3.592ms
End:     KZG10::Setup with prover degree 256 and verifier degree 1 .................4.949ms
Start:   encode payload bytes into polynomials
End:     encode payload bytes into polynomials .....................................186.731ms
Start:   batch poly commit
End:     batch poly commit .........................................................402.031ms

Start:   encode payload bytes into polynomials
End:     encode payload bytes into polynomials .....................................165.483ms
Start:   batch poly commit
··Start:   batch commit 4229 polynomials
··End:     batch commit 4229 polynomials ...........................................8.006s
End:     batch poly commit .........................................................8.006s
test vid::advz::tests::commit_only_timer ... ok

running 1 test
Start:   KZG10::Setup with prover degree 256 and verifier degree 1
··Start:   Generating powers of G
··End:     Generating powers of G ..................................................3.726ms
End:     KZG10::Setup with prover degree 256 and verifier degree 1 .................5.150ms
Start:   VID disperse 33554432 payload bytes to 512 nodes
··Start:   encode payload bytes into polynomials
··End:     encode payload bytes into polynomials ...................................172.910ms
··Start:   compute all storage node evals for 4229 polynomials with 256 coefficients
··End:     compute all storage node evals for 4229 polynomials with 256 coefficients 255.951ms
··Start:   compute merkle root of all storage node evals
··End:     compute merkle root of all storage node evals ...........................113.826ms
··Start:   compute 4229 KZG commitments
··End:     compute 4229 KZG commitments ............................................401.086ms
··Start:   compute aggregate proofs for 512 storage nodes
····Start:   compute h_poly
····End:     compute h_poly ........................................................288.535ms
····Start:   gen eval proofs with parallel_factor 2 and num_points 512
····End:     gen eval proofs with parallel_factor 2 and num_points 512 .............96.601ms
··End:     compute aggregate proofs for 512 storage nodes ..........................388.802ms
··Start:   assemble shares for dispersal
··End:     assemble shares for dispersal ...........................................67.736ms
End:     VID disperse 33554432 payload bytes to 512 nodes ..........................1.436s

Start:   VID disperse 33554432 payload bytes to 512 nodes
··Start:   encode payload bytes into polynomials
··End:     encode payload bytes into polynomials ...................................166.105ms
··Start:   compute all storage node evals for 4229 polynomials with 256 coefficients
··End:     compute all storage node evals for 4229 polynomials with 256 coefficients 198.006ms
··Start:   compute merkle root of all storage node evals
··End:     compute merkle root of all storage node evals ...........................94.184ms
··Start:   compute 4229 KZG commitments
····Start:   batch commit 4229 polynomials
····End:     batch commit 4229 polynomials .........................................8.028s
··End:     compute 4229 KZG commitments ............................................8.028s
··Start:   compute aggregate proofs for 512 storage nodes
····Start:   compute h_poly
····End:     compute h_poly ........................................................287.139ms
····Start:   gen eval proofs with parallel_factor 2 and num_points 512
····End:     gen eval proofs with parallel_factor 2 and num_points 512 .............96.493ms
··End:     compute aggregate proofs for 512 storage nodes ..........................387.300ms
··Start:   assemble shares for dispersal
··End:     assemble shares for dispersal ...........................................17.721ms
End:     VID disperse 33554432 payload bytes to 512 nodes ..........................8.928s
test vid::advz::tests::disperse_timer ... ok

`k=32, n=128, m=4, |B| = 33MB`

❗ I believe this setup/regime is closer to what we will actually use. With limited CPU cores, our disperse takes 1.9 sec.

running 1 test
Start:   KZG10::Setup with prover degree 128 and verifier degree 1
··Start:   Generating powers of G
··End:     Generating powers of G ..................................................2.469ms
End:     KZG10::Setup with prover degree 128 and verifier degree 1 .................3.784ms
Start:   encode payload bytes into polynomials
End:     encode payload bytes into polynomials .....................................176.997ms
Start:   batch poly commit
End:     batch poly commit .........................................................736.935ms

Start:   encode payload bytes into polynomials
End:     encode payload bytes into polynomials .....................................168.720ms
Start:   batch poly commit
··Start:   batch commit 8457 polynomials
··End:     batch commit 8457 polynomials ...........................................9.347s
End:     batch poly commit .........................................................9.347s
test vid::advz::tests::commit_only_timer ... ok

running 1 test
Start:   KZG10::Setup with prover degree 128 and verifier degree 1
··Start:   Generating powers of G
··End:     Generating powers of G ..................................................2.739ms
End:     KZG10::Setup with prover degree 128 and verifier degree 1 .................4.159ms
Start:   VID disperse 33554432 payload bytes to 128 nodes
··Start:   encode payload bytes into polynomials
··End:     encode payload bytes into polynomials ...................................179.144ms
··Start:   compute all storage node evals for 8457 polynomials with 128 coefficients
··End:     compute all storage node evals for 8457 polynomials with 128 coefficients 521.122ms
··Start:   compute merkle root of all storage node evals
··End:     compute merkle root of all storage node evals ...........................226.218ms
··Start:   compute 8457 KZG commitments
··End:     compute 8457 KZG commitments ............................................736.033ms
··Start:   compute aggregate proofs for 128 storage nodes
····Start:   compute h_poly
····End:     compute h_poly ........................................................86.436ms
····Start:   gen eval proofs with parallel_factor 4 and num_points 512
····End:     gen eval proofs with parallel_factor 4 and num_points 512 .............78.720ms
··End:     compute aggregate proofs for 128 storage nodes ..........................168.839ms
··Start:   assemble shares for dispersal
··End:     assemble shares for dispersal ...........................................81.293ms
End:     VID disperse 33554432 payload bytes to 128 nodes ..........................1.949s

Start:   VID disperse 33554432 payload bytes to 128 nodes
··Start:   encode payload bytes into polynomials
··End:     encode payload bytes into polynomials ...................................161.359ms
··Start:   compute all storage node evals for 8457 polynomials with 128 coefficients
··End:     compute all storage node evals for 8457 polynomials with 128 coefficients 405.542ms
··Start:   compute merkle root of all storage node evals
··End:     compute merkle root of all storage node evals ...........................178.632ms
··Start:   compute 8457 KZG commitments
····Start:   batch commit 8457 polynomials
····End:     batch commit 8457 polynomials .........................................9.348s
··End:     compute 8457 KZG commitments ............................................9.348s
··Start:   compute aggregate proofs for 128 storage nodes
····Start:   compute h_poly
····End:     compute h_poly ........................................................86.483ms
····Start:   gen eval proofs with parallel_factor 4 and num_points 512
····End:     gen eval proofs with parallel_factor 4 and num_points 512 .............78.709ms
··End:     compute aggregate proofs for 128 storage nodes ..........................168.857ms
··Start:   assemble shares for dispersal
··End:     assemble shares for dispersal ...........................................45.354ms
End:     VID disperse 33554432 payload bytes to 128 nodes ..........................10.344s
test vid::advz::tests::disperse_timer ... ok

Before we can merge this PR, please make sure that all the following items have been checked off. If any of the checklist items are not applicable, please leave them but write a little note why.

[x] Targeted PR against correct branch (main)
[x] Linked to GitHub issue with discussion and accepted design OR have an explanation in the PR that describes this work.
[x] ~Wrote unit tests~
[x] Updated relevant documentation in the code
[ ] Added a relevant changelog entry to the Pending section in CHANGELOG.md
[ ] Re-reviewed Files changed in the GitHub PR explorer

mrain commented 6 months ago

I was flooded by profiling the advz dispersal cargo test --release -p jf-primitives --features test-srs,icicle,kzg-print-trace,parallel,gpu-vid -- disperse_timer --ignored --nocapture

It's calling batch_commit which launches many commits in parallel, and each of them will start a timer in a nested manner b/c kzg-print-trace feature is on. Should we remove the inner timers in commit, and add cfg feature flags in batch_commit timers?

nvm we shouldn't have kzg-print-trace on

mrain commented 6 months ago

My local run results are listed below. Batch commit performance is still not perfect. Will check later

Start:   KZG10::Setup with prover degree 256 and verifier degree 1
··Start:   Generating powers of G
··End:     Generating powers of G ..................................................3.119ms
End:     KZG10::Setup with prover degree 256 and verifier degree 1 .................4.762ms
Start:   VID disperse 33554432 payload bytes to 512 nodes
··Start:   encode payload bytes into polynomials
··End:     encode payload bytes into polynomials ...................................67.697ms
··Start:   compute all storage node evals for 4229 polynomials with 256 coefficients
··End:     compute all storage node evals for 4229 polynomials with 256 coefficients 147.929ms
··Start:   compute merkle root of all storage node evals
··End:     compute merkle root of all storage node evals ...........................74.827ms
··Start:   compute 4229 KZG commitments
··End:     compute 4229 KZG commitments ............................................138.304ms
··Start:   compute aggregate proofs for 512 storage nodes
····Start:   compute h_poly
····End:     compute h_poly ........................................................53.948ms
····Start:   gen eval proofs with parallel_factor 2 and num_points 512
····End:     gen eval proofs with parallel_factor 2 and num_points 512 .............50.675ms
··End:     compute aggregate proofs for 512 storage nodes ..........................106.559ms
··Start:   assemble shares for dispersal
··End:     assemble shares for dispersal ...........................................37.821ms
End:     VID disperse 33554432 payload bytes to 512 nodes ..........................603.895ms
Start:   VID disperse 33554432 payload bytes to 512 nodes
··Start:   encode payload bytes into polynomials
··End:     encode payload bytes into polynomials ...................................28.887ms
··Start:   compute all storage node evals for 4229 polynomials with 256 coefficients
··End:     compute all storage node evals for 4229 polynomials with 256 coefficients 56.072ms
··Start:   compute merkle root of all storage node evals
··End:     compute merkle root of all storage node evals ...........................66.781ms
··Start:   compute 4229 KZG commitments
····Start:   batch commit 4229 polynomials
····End:     batch commit 4229 polynomials .........................................1.532s
··End:     compute 4229 KZG commitments ............................................1.532s
··Start:   compute aggregate proofs for 512 storage nodes
····Start:   compute h_poly
····End:     compute h_poly ........................................................58.700ms
····Start:   gen eval proofs with parallel_factor 2 and num_points 512
····End:     gen eval proofs with parallel_factor 2 and num_points 512 .............49.792ms
··End:     compute aggregate proofs for 512 storage nodes ..........................110.447ms
··Start:   assemble shares for dispersal
··End:     assemble shares for dispersal ...........................................14.715ms
End:     VID disperse 33554432 payload bytes to 512 nodes ..........................1.836s

EspressoSystems / jellyfish