Open huitseeker opened 11 months ago
These are the latest benchmarks from lurk-rs
(note: we use lurk-rs
because it directly represents the optimal performance we want to target).
Beginning proof... (rc = 100)
nova::RecursiveSNARK::prove_step 1.491457s ├───────────────────────────────────────────────────────────────┤
<MultiFrame as StepCircuit>::synthesize 421.076ms ├────────────────┤
<_ as Group>::vartime_multiscalar_mul 437.97ms ├─────────────────┤
NIFS::prove 600.238ms ├────────────────────────┤
AZ_1, BZ_1, CZ_1 204.777ms ├───────┤
AZ_2, BZ_2, CZ_2 84.483ms ├──┤
cross terms 22.91ms │
T 7.051ms ┆
<_ as Group>::vartime_multiscalar_mul 265.757ms ├──────────┤
Congratulations! You proved and verified a SHA256 hash calculation!
Beginning proof... (rc = 1000)
nova::RecursiveSNARK::prove_step 31.856102s ├───────────────────────────────────────────────────────────────┤
<MultiFrame as StepCircuit>::synthesize 4.161261s ├──────┤
<_ as Group>::vartime_multiscalar_mul 3.931264s ├──────┤
NIFS::prove 23.60857s ├──────────────────────────────────────────────┤
AZ_1, BZ_1, CZ_1 12.360517s ├───────────────────────┤
AZ_2, BZ_2, CZ_2 2.376828s ├───┤
cross terms 1.02214s ├┤
T 437.983ms │
<_ as Group>::vartime_multiscalar_mul 6.796674s ├────────────┤
Congratulations! You proved and verified a SHA256 hash calculation!
Building on what @huitseeker has already pointed out about AZ_1, BZ_1, CZ_1
and AZ_2, BZ_2, CZ_2
, and the sparse matrix considerations.
rc=100
and rc=1000
. rc
is just some parameter in Lurk; circuit sizes linearly increase with rc
.AZ_1, BZ_1, CZ_1
and AZ_2, BZ_2, CZ_2
takes up significantly more ratio of computation when the circuit is large. In fact, the computation at rc=1000
is dominated by the folding step, even though in theory it should be dominated by the MSM.AZ_1, BZ_1, CZ_1
and AZ_2, BZ_2, CZ_2
increases by ~50x (from ~300ms to ~14s), disproportionally from the 10x circuit increase.To reproduce these runs, clone the lurk-rs
repo and checkout https://github.com/lurk-lab/lurk-rs/tree/spmvm-benchmarks. Then run RUST_LOG=info cargo run --release --example one_iteration
.
The pipeline could be modified to parallelize the work between GPU and CPU.
The commitment to $W$ via the GPU and the computation of $T$ via the CPU can happen in parallel. The computation of $T$ still needs to wait for the computation of $T$ to finish.
Background:
Both the Nova project and our for Arecibo consider performance of folding to be critical. During our recent analysis, we noticed that the commitment to cross-terms is a significant component of the folding performance. This observation holds true even when the costs are, for large enough ops, primarily dominated by an MSM operation, as predicted by theoretical analysis.
Findings:
The texray graph showcasing our investigation's results is as follows:
@winston-h-zhang to provide more information on reproduction here
The commitment to cross-terms is evident at the following location in the codebase: arecibo/src/nifs.rs#L52C1-L53
The primary computational complexity in this function arises when computing a matrix-vector product between the R1CS matrices and an incoming instance-witness pair. This is visible at: arecibo/src/r1cs/mod.rs#L285-L301
The matrix-vector product terms are represented as AZ_1, BZ_1, CZ_1 and AZ_2, BZ_2, CZ_2 in the texray graph.
To optimize the operation's speed, we transitioned the matrix representation from COO to CSR given that they are represented in sparse form.
Challenges:
While the MSM operation can be GPU-accelerated (as seen in the pasta-msm project), the field multiplications involved in the matrix-vector product are currently not.
Proposed Solution:
It's imperative to accelerate these field multiplications to achieve optimal performance for the folding operation.