relative performance of Mongomery and Edwards scalarmul

oconnor663 commented 9 months ago

Apologies for filing an issue that's more of a question :) My very basic understanding of the performance relationship between Curve25519 and Edwards25519 is that the former is faster when you're scalarmult'ing an arbitrary point, like you do in the second half of Diffie-Hellman. But when I benchmark scalarmult in this library (i5-1145G7 Linux laptop), it looks like the Edwards curve is faster:

use criterion::{criterion_group, criterion_main, Criterion};

pub fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("curve25519_dalek Montgomery scalarmul", |b| {
        let point = curve25519_dalek::montgomery::MontgomeryPoint::mul_base_clamped(rand::random());
        b.iter(|| point.mul_clamped(rand::random()))
    });

    c.bench_function("curve25519_dalek Edwards scalarmul", |b| {
        let point = curve25519_dalek::edwards::EdwardsPoint::mul_base_clamped(rand::random());
        b.iter(|| point.mul_clamped(rand::random()))
    });
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

$ cargo bench
   Compiling scratch v0.1.0 (/tmp/tmp.x9omrxNKmT/scratch)
    Finished bench [optimized] target(s) in 1.17s
     Running unittests src/main.rs (target/release/deps/scratch-67fe8039c398f6ac)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/bench.rs (target/release/deps/bench-3a6cb2fec14f5ab6)
curve25519_dalek Montgomery scalarmul
                        time:   [41.320 µs 41.428 µs 41.561 µs]
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) high mild
  7 (7.00%) high severe

curve25519_dalek Edwards scalarmul
                        time:   [23.729 µs 23.757 µs 23.789 µs]
Found 15 outliers among 100 measurements (15.00%)
  6 (6.00%) high mild
  9 (9.00%) high severe

There's a big chance that my assumption going in was wrong, or that I'm not benchmarking what I think I'm benchmarking. Am I right that this is surprising? Does this suggest that doing Diffie-Hellman on Edwards25519 directly (not converting to the Mongtomery form first) would actually be faster than regular X25519?

tarcieri commented 9 months ago

curve25519-dalek implements fixed-based Montgomery multiplication in terms of Edwards multiplication:

https://github.com/dalek-cryptography/curve25519-dalek/blob/8e0cef5/curve25519-dalek/src/montgomery.rs#L123

Notably this allows it to leverage the precomputed basepoint tables for the Edwards form, so we don't need to maintain two different copies of the basepoint tables for the two different curve forms.

You're free to use either form for ECDH, though for interop purposes you may find it easier to use X25519.

oconnor663 commented 9 months ago

Fascinating, thanks for the lightning fast reply.

oconnor663 commented 9 months ago

Actually, I'm not sure we're talking about the same thing. What I'm trying to benchmark above is arbitrary/variable base scalarmult, via mul_clamped, which I think bottoms out at mul_bits_be (which by complete coincidence was introduced in the commit you linked to?). The calls to mul_base_clamped aren't part of the benchmarking loop. Am I right to think that that does not go through the Edwards form?

tarcieri commented 9 months ago

Aah yes, sorry I misread your benchmark. mul_base_clamped goes through mul_base where indeed mul_clamped goes through mul_bits_be. So fixed-base multiplication converts to Edwards form, whereas variable-base multiplication uses the Montgomery ladder, in case that's unclear.

oconnor663 commented 9 months ago

Yes that's what it looks like to me, which brings me back to the performance question: Isn't the Mongtomery ladder supposed to be faster than scalar multiplication on the Edwards curve? Is it weird that Edwards seems faster in this benchmark?

tarcieri commented 9 months ago

Edwards arithmetic in curve25519-dalek has multiple backends including highly optimized architecture-specific SIMD backends including ones for AVX2 and AVX-512, whereas the Montgomery backend is relatively simplistic.

oconnor663 commented 9 months ago

Got it, that makes sense. And yeah, I do notice a decent speedup (~17µs instead of ~23) when I benchmark nightly and pick up the AVX-512 optimizations. Thanks again!

dalek-cryptography / curve25519-dalek

relative performance of Mongomery and Edwards scalarmul #588