Acceleration of SpMVM in folding

Background:

Both the Nova project and our for Arecibo consider performance of folding to be critical. During our recent analysis, we noticed that the commitment to cross-terms is a significant component of the folding performance. This observation holds true even when the costs are, for large enough ops, primarily dominated by an MSM operation, as predicted by theoretical analysis.

Findings:

The texray graph showcasing our investigation's results is as follows:

RecursiveSNARK::prove_step                       5s     ├───────────────────────────────────────────────────────────────┤
    <MultiFrame as StepCircuit>::synthesize      1s     ├────────────┤
    <_ as Group>::vartime_multiscalar_mul      947ms                  ├─────────┤
    NIFS::prove                                  3s                              ├─────────────────────────────────────┤
        AZ_1, BZ_1, CZ_1                         1s                              ├─────────────┤
        AZ_2, BZ_2, CZ_2                       767ms                                             ├───────┤
        cross terms                            263ms                                                      ├─┤
        T                                      202ms                                                         ├┤
        <_ as Group>::vartime_multiscalar_mul  674ms                                                           ├──────┤
    <_ as Group>::vartime_multiscalar_mul        5ms                                                                    ┆

@winston-h-zhang to provide more information on reproduction here

The commitment to cross-terms is evident at the following location in the codebase: arecibo/src/nifs.rs#L52C1-L53
The primary computational complexity in this function arises when computing a matrix-vector product between the R1CS matrices and an incoming instance-witness pair. This is visible at: arecibo/src/r1cs/mod.rs#L285-L301
The matrix-vector product terms are represented as AZ_1, BZ_1, CZ_1 and AZ_2, BZ_2, CZ_2 in the texray graph.
To optimize the operation's speed, we transitioned the matrix representation from COO to CSR given that they are represented in sparse form.

Challenges:

While the MSM operation can be GPU-accelerated (as seen in the pasta-msm project), the field multiplications involved in the matrix-vector product are currently not.

Proposed Solution:

It's imperative to accelerate these field multiplications to achieve optimal performance for the folding operation.

These are the latest benchmarks from lurk-rs (note: we use lurk-rs because it directly represents the optimal performance we want to target).

Texray Graphs

Beginning proof... (rc = 100)
  nova::RecursiveSNARK::prove_step           1.491457s ├───────────────────────────────────────────────────────────────┤
    <MultiFrame as StepCircuit>::synthesize  421.076ms  ├────────────────┤
    <_ as Group>::vartime_multiscalar_mul     437.97ms                    ├─────────────────┤
    NIFS::prove                              600.238ms                                       ├────────────────────────┤
      AZ_1, BZ_1, CZ_1                       204.777ms                                       ├───────┤
      AZ_2, BZ_2, CZ_2                        84.483ms                                                ├──┤
      cross terms                              22.91ms                                                    │
      T                                        7.051ms                                                     ┆
      <_ as Group>::vartime_multiscalar_mul  265.757ms                                                     ├──────────┤
Congratulations! You proved and verified a SHA256 hash calculation!

Beginning proof... (rc = 1000)
  nova::RecursiveSNARK::prove_step          31.856102s ├───────────────────────────────────────────────────────────────┤
    <MultiFrame as StepCircuit>::synthesize  4.161261s ├──────┤
    <_ as Group>::vartime_multiscalar_mul    3.931264s          ├──────┤
    NIFS::prove                              23.60857s                  ├──────────────────────────────────────────────┤
      AZ_1, BZ_1, CZ_1                      12.360517s                  ├───────────────────────┤
      AZ_2, BZ_2, CZ_2                       2.376828s                                           ├───┤
      cross terms                             1.02214s                                                ├┤
      T                                      437.983ms                                                  │
      <_ as Group>::vartime_multiscalar_mul  6.796674s                                                   ├────────────┤
Congratulations! You proved and verified a SHA256 hash calculation!

Findings

Building on what @huitseeker has already pointed out about AZ_1, BZ_1, CZ_1 and AZ_2, BZ_2, CZ_2, and the sparse matrix considerations.

The two texray graphs above are meant to show performance at two circuit sizes rc=100 and rc=1000. rc is just some parameter in Lurk; circuit sizes linearly increase with rc.
The computation of AZ_1, BZ_1, CZ_1 and AZ_2, BZ_2, CZ_2 takes up significantly more ratio of computation when the circuit is large. In fact, the computation at rc=1000 is dominated by the folding step, even though in theory it should be dominated by the MSM.
The computation of AZ_1, BZ_1, CZ_1 and AZ_2, BZ_2, CZ_2 increases by ~50x (from ~300ms to ~14s), disproportionally from the 10x circuit increase.

Reproduce

To reproduce these runs, clone the lurk-rs repo and checkout https://github.com/lurk-lab/lurk-rs/tree/spmvm-benchmarks. Then run RUST_LOG=info cargo run --release --example one_iteration.

The pipeline could be modified to parallelize the work between GPU and CPU.

The commitment to $W$ via the GPU and the computation of $T$ via the CPU can happen in parallel. The computation of $T$ still needs to wait for the computation of $T$ to finish.

argumentcomputer / arecibo