rickwebiii commented 1 year ago

A ZKP system we're implementing internally uses a modified Bulletproofs inner product proof. We want to make it faster, so we implemented scalar multiplication (SM) in Metal (OpenCL/Vulkan coming soon). This operation accounts for 75% of our proof's runtime. Is there any interest in merging these changes back upstream?

Note, we've sped up computing multiple scalar multiplication in parallel, not a multi-scalar multiplication (MSM). Though we may do that in the future as well.

The basic idea

Implement new ScalarVec and RistrettoPointVec types. These hold an array of Scalar or EdwardsPoints respectively with the data packed so the GPU can perform coalesced loads and stores. I.e. in a vector with n elements the jth limb of the ith point holds is at arr[j * n + i].
The compute Metal shaders are transliterated C++ from the 32-bit backend. The shaders' RistrerettoPoint and Scalar types feature additional methods for packing and unpacking from device memory into registers.
All computation occurs in GPU registers.
We launch a kernel on a 1-d grid with N threads and 64 threads in a threadgroup.

Performance

We're so far only run this on Mac hardware (what Metal supports) and have pretty good results. On an entry-level M1 2020 Macbook air, the integrated GPU acheives 431k SM/s while the using every core of the CPU gets you only 150k SM/s (64-bit backend). For comparison, our measurements indicate that the 64-bit backend on aarch64 is slightly faster than when running on a c6 AWS x86_64 instance with the AVX2 backend (single core performance).

We've also run this benchmark on a co-worker's 30-core GPU M2 Max which can perform 1.8M SM/s. A low-power laptop GPU beats a 64-core AWS instance by 1.8x (AVX2 backend).

Work remaining

Either an OpenCL or Vulkan implementation
Share work between CPU and GPU on systems where GPU isn't way more powerful than CPUs.
A few pieces of code may not be constant time (e.g. the Radix-16 lookup tables). While this doesn't matter in our proof since both the prover and verifier are performing the same SMs, we don't want others to misuse this.
This work currently requires the serial 32-bit backend. We should add conversion methods between FieldElement* and Scalar* types for different backends so users can use any bit width and backend on the CPU.
Benchmark this on a real GPU, like an RTX 4090.

Why you might not want this

At the very least, these GPU backends should be enabled under a feature. While the GPU shaders are transliterated from Rust code, they aren't Rust and don't have the safety guarantees.

If it's not appropriate to integrate this work into curve25519-dalek, we need an API for accessing the currently internal Field members on EdwardsPoint and an API for converting/accessing Scalars as Scalar29.

tarcieri commented 1 year ago

Is there any interest in merging these changes back upstream?

The main thing that seems tricky about upstreaming this is the lack of a CI story. Is there any way to test it in an environment like GitHub Actions without access to a hardware GPU?

If it's not appropriate to integrate this work into curve25519-dalek, we need an API for accessing the currently internal Field members on EdwardsPoint and an API for converting/accessing Scalars as Scalar29.

These types are both deliberately kept out of the public API as they're easily misused internals. However, we have discussed exposing at least FieldElement under a special feature like hazmat.

rickwebiii commented 1 year ago

The main thing that seems tricky about upstreaming this is the lack of a CI story. Is there any way to test it in an environment like GitHub Actions without access to a hardware GPU?

If you use a hosted runner, you can have whatever hardware you want, but you have to pay the maintenance costs that come with that. We have a work item on our end to address this for our use case so maybe we could try it out and let you know how annoying (or not) this is.

These types are both deliberately kept out of the public API as they're easily misused internals. However, we have discussed exposing at least FieldElement under a special feature like hazmat.

As a user, I appreciated not accidentally dealing with these things. A hazmat feature is a clean way to deal with this.

tarcieri commented 1 year ago

If you use a hosted runner, you can have whatever hardware you want, but you have to pay the maintenance costs that come with that.

I don't think there is a budget for or interest in maintaining a hosted runner.

rickwebiii commented 1 year ago

After doing a bit of research, it might be possible to use the mesa lavapipe driver to run the compute kernels on the CPU. This needs a bunch of investigation, but I'll post an update if I get a test of this working.

burdges commented 2 months ago

At present dalek is single threaded, but maybe multi-threaded CPU based MSMs make sense here?

At https://zka.lc/ try large and small MSM for curve25519-dalek and arkworks' curve25519. Arkworks is much ~~faster~~ lower latency than curve25519-dalek for the large MSM because arkworks is typically multi-threaded, and curve25519-dalek is much faster for the smaller MSMs becuase curve25519-dalek is actually optimized.

As an aside, I've argued zka.lc should show/support single threaded benchmarks, but zk proofs people never discuss single threaded performance benchmarks.

dalek-cryptography / curve25519-dalek

Interest in GPU-accelerated scalar multiplications? #506

The basic idea

Performance

Work remaining

Why you might not want this