argumentcomputer / arecibo

An advanced fork of Nova (contact:@huitseeker)
https://lurk-lang.org/
MIT License
74 stars 31 forks source link

Acceleration of SpMVM in folding #75

Open huitseeker opened 11 months ago

huitseeker commented 11 months ago

Background:

Both the Nova project and our for Arecibo consider performance of folding to be critical. During our recent analysis, we noticed that the commitment to cross-terms is a significant component of the folding performance. This observation holds true even when the costs are, for large enough ops, primarily dominated by an MSM operation, as predicted by theoretical analysis.

Findings:

The texray graph showcasing our investigation's results is as follows:

RecursiveSNARK::prove_step                       5s     ├───────────────────────────────────────────────────────────────┤
    <MultiFrame as StepCircuit>::synthesize      1s     ├────────────┤
    <_ as Group>::vartime_multiscalar_mul      947ms                  ├─────────┤
    NIFS::prove                                  3s                              ├─────────────────────────────────────┤
        AZ_1, BZ_1, CZ_1                         1s                              ├─────────────┤
        AZ_2, BZ_2, CZ_2                       767ms                                             ├───────┤
        cross terms                            263ms                                                      ├─┤
        T                                      202ms                                                         ├┤
        <_ as Group>::vartime_multiscalar_mul  674ms                                                           ├──────┤
    <_ as Group>::vartime_multiscalar_mul        5ms                                                                    ┆

@winston-h-zhang to provide more information on reproduction here

Challenges:

While the MSM operation can be GPU-accelerated (as seen in the pasta-msm project), the field multiplications involved in the matrix-vector product are currently not.

Proposed Solution:

It's imperative to accelerate these field multiplications to achieve optimal performance for the folding operation.

winston-h-zhang commented 11 months ago

These are the latest benchmarks from lurk-rs (note: we use lurk-rs because it directly represents the optimal performance we want to target).

Texray Graphs

Beginning proof... (rc = 100)
  nova::RecursiveSNARK::prove_step           1.491457s ├───────────────────────────────────────────────────────────────┤
    <MultiFrame as StepCircuit>::synthesize  421.076ms  ├────────────────┤
    <_ as Group>::vartime_multiscalar_mul     437.97ms                    ├─────────────────┤
    NIFS::prove                              600.238ms                                       ├────────────────────────┤
      AZ_1, BZ_1, CZ_1                       204.777ms                                       ├───────┤
      AZ_2, BZ_2, CZ_2                        84.483ms                                                ├──┤
      cross terms                              22.91ms                                                    │
      T                                        7.051ms                                                     ┆
      <_ as Group>::vartime_multiscalar_mul  265.757ms                                                     ├──────────┤
Congratulations! You proved and verified a SHA256 hash calculation!

Beginning proof... (rc = 1000)
  nova::RecursiveSNARK::prove_step          31.856102s ├───────────────────────────────────────────────────────────────┤
    <MultiFrame as StepCircuit>::synthesize  4.161261s ├──────┤
    <_ as Group>::vartime_multiscalar_mul    3.931264s          ├──────┤
    NIFS::prove                              23.60857s                  ├──────────────────────────────────────────────┤
      AZ_1, BZ_1, CZ_1                      12.360517s                  ├───────────────────────┤
      AZ_2, BZ_2, CZ_2                       2.376828s                                           ├───┤
      cross terms                             1.02214s                                                ├┤
      T                                      437.983ms                                                  │
      <_ as Group>::vartime_multiscalar_mul  6.796674s                                                   ├────────────┤
Congratulations! You proved and verified a SHA256 hash calculation!

Findings

Building on what @huitseeker has already pointed out about AZ_1, BZ_1, CZ_1 and AZ_2, BZ_2, CZ_2, and the sparse matrix considerations.

Reproduce

To reproduce these runs, clone the lurk-rs repo and checkout https://github.com/lurk-lab/lurk-rs/tree/spmvm-benchmarks. Then run RUST_LOG=info cargo run --release --example one_iteration.

adr1anh commented 9 months ago

The pipeline could be modified to parallelize the work between GPU and CPU.

The commitment to $W$ via the GPU and the computation of $T$ via the CPU can happen in parallel. The computation of $T$ still needs to wait for the computation of $T$ to finish.