Open rickwebiii opened 1 year ago
Is there any interest in merging these changes back upstream?
The main thing that seems tricky about upstreaming this is the lack of a CI story. Is there any way to test it in an environment like GitHub Actions without access to a hardware GPU?
If it's not appropriate to integrate this work into curve25519-dalek, we need an API for accessing the currently internal Field members on EdwardsPoint and an API for converting/accessing Scalars as Scalar29.
These types are both deliberately kept out of the public API as they're easily misused internals. However, we have discussed exposing at least FieldElement
under a special feature like hazmat
.
The main thing that seems tricky about upstreaming this is the lack of a CI story. Is there any way to test it in an environment like GitHub Actions without access to a hardware GPU?
If you use a hosted runner, you can have whatever hardware you want, but you have to pay the maintenance costs that come with that. We have a work item on our end to address this for our use case so maybe we could try it out and let you know how annoying (or not) this is.
These types are both deliberately kept out of the public API as they're easily misused internals. However, we have discussed exposing at least FieldElement under a special feature like hazmat.
As a user, I appreciated not accidentally dealing with these things. A hazmat
feature is a clean way to deal with this.
If you use a hosted runner, you can have whatever hardware you want, but you have to pay the maintenance costs that come with that.
I don't think there is a budget for or interest in maintaining a hosted runner.
After doing a bit of research, it might be possible to use the mesa lavapipe driver to run the compute kernels on the CPU. This needs a bunch of investigation, but I'll post an update if I get a test of this working.
At present dalek is single threaded, but maybe multi-threaded CPU based MSMs make sense here?
At https://zka.lc/ try large and small MSM for curve25519-dalek and arkworks' curve25519. Arkworks is much faster lower latency than curve25519-dalek for the large MSM because arkworks is typically multi-threaded, and curve25519-dalek is much faster for the smaller MSMs becuase curve25519-dalek is actually optimized.
As an aside, I've argued zka.lc should show/support single threaded benchmarks, but zk proofs people never discuss single threaded performance benchmarks.
A ZKP system we're implementing internally uses a modified Bulletproofs inner product proof. We want to make it faster, so we implemented scalar multiplication (SM) in Metal (OpenCL/Vulkan coming soon). This operation accounts for 75% of our proof's runtime. Is there any interest in merging these changes back upstream?
Note, we've sped up computing multiple scalar multiplication in parallel, not a multi-scalar multiplication (MSM). Though we may do that in the future as well.
The basic idea
ScalarVec
andRistrettoPointVec
types. These hold an array ofScalar
orEdwardsPoint
s respectively with the data packed so the GPU can perform coalesced loads and stores. I.e. in a vector withn
elements thejth
limb of theith
point holds is atarr[j * n + i]
.RistrerettoPoint
andScalar
types feature additional methods for packing and unpacking from device memory into registers.Performance
We're so far only run this on Mac hardware (what Metal supports) and have pretty good results. On an entry-level M1 2020 Macbook air, the integrated GPU acheives 431k SM/s while the using every core of the CPU gets you only 150k SM/s (64-bit backend). For comparison, our measurements indicate that the 64-bit backend on aarch64 is slightly faster than when running on a c6 AWS x86_64 instance with the AVX2 backend (single core performance).
We've also run this benchmark on a co-worker's 30-core GPU M2 Max which can perform 1.8M SM/s. A low-power laptop GPU beats a 64-core AWS instance by 1.8x (AVX2 backend).
Work remaining
FieldElement*
andScalar*
types for different backends so users can use any bit width and backend on the CPU.Why you might not want this
At the very least, these GPU backends should be enabled under a feature. While the GPU shaders are transliterated from Rust code, they aren't Rust and don't have the safety guarantees.
If it's not appropriate to integrate this work into curve25519-dalek, we need an API for accessing the currently internal
Field
members onEdwardsPoint
and an API for converting/accessingScalar
s asScalar29
.