Poor Scalar performance

dalek-cryptography / curve25519-dalek

A pure-Rust implementation of group operations on Ristretto and Curve25519

Other

894 stars 460 forks source link

Poor Scalar performance #507

Closed kayabaNerve closed 1 year ago

kayabaNerve commented 1 year ago

dalek scalars are about 10x as inperformant as k256 Scalars, for both addition and multiplication, according to some recent benchmarking I had performed. While that isn't a fair comparison, one being curve25519 and one being secp256k1, it is a striking difference that shows something is likely going on with dalek's impl. The following block of code was pointed out to me as the likely culprit: https://github.com/dalek-cryptography/curve25519-dalek/blob/b2d0f0ecd7808f194e67a2fcd4c47dd5be8a0697/src/scalar.rs#L367-L375 Please note I have not flamegraphed this to confirm.

Would it be possible to expose an addition which does not hedge against the potential use of an API in existence for x25519, greatly accelerating Ed25519 operations? Potentially a distinct ReducedScalar type entirely?

tarcieri commented 1 year ago

I'm curious if there are actually good use cases for Scalar::from_bits to be public, or if they're covered by Scalar::from_bits_clamped.

Also curious what overhead using bytes as the internal "packed" representation of a scalar might be adding (if any) or if it impacts LLVM inlining (k256 uses an array of limbs, similar to UnpackedScalar).

kayabaNerve commented 1 year ago

Beyond x25519, which I can't comment on, I think the remaining concern may be how the Group API requires the modulus as a Scalar (not just via char_le_bits), which requires unreduced Scalars as a possibility.

tarcieri commented 1 year ago

I think the remaining concern may be how the Group API requires the modulus as a Scalar

It doesn't though? For Scalar it'd be the Field and PrimeField traits, and of those there's only PrimeField::MODULUS, which is a freeform string.

not just via char_le_bits

(note: there's that on the PrimeFieldBits trait as well)

But beyond that there's nothing that requires an unreduced Scalar in the API, in either ff or group

kayabaNerve commented 1 year ago

... you are right. I've even implemented 0.13 into my own libs. I have no idea why I thought it did that. Sorry.

kayabaNerve commented 1 year ago

BASEPOINT_ORDER also currently exists as a way to access an unreduced Scalar.

tarcieri commented 1 year ago

Indeed. Wonder if that could be made private at least.

rozbb commented 1 year ago

Clamping makes the given scalar a multiple of 8, so we'd definitely still need to expose from_bits for ff. But maybe from_bits should do a reduction? I think it used to but it doesn't anymore. @tarcieri do you recall?

kayabaNerve commented 1 year ago

Considering the order of the points actually generatable from G is 2^252 + ..., and the Scalar type is documented as All arithmetic on Scalars is done modulo \( \ell \)., I'd call for Scalars to be canonical (or reduced) to l when entered unless there's breaking behavior caused by x25519.

If x25519 does have breaking behavior, I'd note that anything expecting those bits to not be cleared (or reduced out) could be shimmed by a function which clears the bits, performs the scalarmul, and then for the set bits, adds the relevant point from https://docs.rs/curve25519-dalek/latest/curve25519_dalek/constants/constant.EIGHT_TORSION.html. Accordingly, while an API breaking change, it'd still be possible to provide a safe, always % l, Scalar API that still acknowledges/supports all use cases.

tarcieri commented 1 year ago

Was this addressed by #519?

kayabaNerve commented 1 year ago

I'd assume so. @rozbb noted the performance benefits of their PR.

rozbb commented 1 year ago

Ok I'll close for now. @kayabaNerve if you end up running some benchmarks against other scalar impls and find that ours is noticeably worse, feel free to re-open. Thanks!