dalek-cryptography / curve25519-dalek

A pure-Rust implementation of group operations on Ristretto and Curve25519
Other
906 stars 466 forks source link

Interest in GPU-accelerated scalar multiplications? #506

Open rickwebiii opened 1 year ago

rickwebiii commented 1 year ago

A ZKP system we're implementing internally uses a modified Bulletproofs inner product proof. We want to make it faster, so we implemented scalar multiplication (SM) in Metal (OpenCL/Vulkan coming soon). This operation accounts for 75% of our proof's runtime. Is there any interest in merging these changes back upstream?

Note, we've sped up computing multiple scalar multiplication in parallel, not a multi-scalar multiplication (MSM). Though we may do that in the future as well.

The basic idea

Performance

We're so far only run this on Mac hardware (what Metal supports) and have pretty good results. On an entry-level M1 2020 Macbook air, the integrated GPU acheives 431k SM/s while the using every core of the CPU gets you only 150k SM/s (64-bit backend). For comparison, our measurements indicate that the 64-bit backend on aarch64 is slightly faster than when running on a c6 AWS x86_64 instance with the AVX2 backend (single core performance).

We've also run this benchmark on a co-worker's 30-core GPU M2 Max which can perform 1.8M SM/s. A low-power laptop GPU beats a 64-core AWS instance by 1.8x (AVX2 backend).

Work remaining

Why you might not want this

At the very least, these GPU backends should be enabled under a feature. While the GPU shaders are transliterated from Rust code, they aren't Rust and don't have the safety guarantees.

If it's not appropriate to integrate this work into curve25519-dalek, we need an API for accessing the currently internal Field members on EdwardsPoint and an API for converting/accessing Scalars as Scalar29.

tarcieri commented 1 year ago

Is there any interest in merging these changes back upstream?

The main thing that seems tricky about upstreaming this is the lack of a CI story. Is there any way to test it in an environment like GitHub Actions without access to a hardware GPU?

If it's not appropriate to integrate this work into curve25519-dalek, we need an API for accessing the currently internal Field members on EdwardsPoint and an API for converting/accessing Scalars as Scalar29.

These types are both deliberately kept out of the public API as they're easily misused internals. However, we have discussed exposing at least FieldElement under a special feature like hazmat.

rickwebiii commented 1 year ago

The main thing that seems tricky about upstreaming this is the lack of a CI story. Is there any way to test it in an environment like GitHub Actions without access to a hardware GPU?

If you use a hosted runner, you can have whatever hardware you want, but you have to pay the maintenance costs that come with that. We have a work item on our end to address this for our use case so maybe we could try it out and let you know how annoying (or not) this is.

These types are both deliberately kept out of the public API as they're easily misused internals. However, we have discussed exposing at least FieldElement under a special feature like hazmat.

As a user, I appreciated not accidentally dealing with these things. A hazmat feature is a clean way to deal with this.

tarcieri commented 1 year ago

If you use a hosted runner, you can have whatever hardware you want, but you have to pay the maintenance costs that come with that.

I don't think there is a budget for or interest in maintaining a hosted runner.

rickwebiii commented 1 year ago

After doing a bit of research, it might be possible to use the mesa lavapipe driver to run the compute kernels on the CPU. This needs a bunch of investigation, but I'll post an update if I get a test of this working.

burdges commented 2 months ago

At present dalek is single threaded, but maybe multi-threaded CPU based MSMs make sense here?

At https://zka.lc/ try large and small MSM for curve25519-dalek and arkworks' curve25519. Arkworks is much faster lower latency than curve25519-dalek for the large MSM because arkworks is typically multi-threaded, and curve25519-dalek is much faster for the smaller MSMs becuase curve25519-dalek is actually optimized.

As an aside, I've argued zka.lc should show/support single threaded benchmarks, but zk proofs people never discuss single threaded performance benchmarks.