Open stevefan1999-personal opened 1 month ago
I made a minimally viable benchmark program here in https://github.com/stevefan1999-personal/ed25519-gen-speed-test. I used the fastest random number generator possible to get rid of its effect.
On my 3700X PC, this generates:
7 keys per second on average
268020 keys per second on average
284770 keys per second on average
264184 keys per second on average
252750 keys per second on average
245968 keys per second on average
241423 keys per second on average
237701 keys per second on average
234541 keys per second on average
232072 keys per second on average
229985 keys per second on average
228280 keys per second on average
I reran the key generation benchmark with "fast" featured enabled back on for ed25519-dalek, and performance improved 3 times, and this is the hot path tree diagram that I got
Take a look at the MultiscalarMul
trait: https://docs.rs/curve25519-dalek/4.1.3/curve25519_dalek/edwards/struct.EdwardsPoint.html#impl-MultiscalarMul-for-EdwardsPoint
Take a look at the
MultiscalarMul
trait: https://docs.rs/curve25519-dalek/4.1.3/curve25519_dalek/edwards/struct.EdwardsPoint.html#impl-MultiscalarMul-for-EdwardsPoint
May I get some papers or lecture notes on how to use that?
It takes an iterator over Scalar
s and an iterator over EdwardsPoint
s and batch multiplies the scalars by the points, so you can generate a batch of candidate scalars and multiply several of them by the base point at once
It takes an iterator over
Scalar
s and an iterator overEdwardsPoint
s and batch multiplies the scalars by the points, so you can generate a batch of candidate scalars and multiply several of them by the base point at once
If I read that correctly, I can use multiscalar_mul to do something that means scalar * constants::ED25519_BASEPOINT_TABLE
but similar to a dot product operation instead, right?
Okay, so I came up with this:
let mut rng = Rng::new();
let scalars: Vec<ExpandedSecretKey> = (0..16)
.map(|_| {
let mut secret_key: SecretKey = SecretKey::default();
rng.fill(&mut secret_key);
ExpandedSecretKey::from(&secret_key)
})
.collect();
EdwardsPoint::multiscalar_mul(
scalars.iter().map(|x| x.scalar),
scalars.iter().map(|_| constants::ED25519_BASEPOINT_POINT),
);
However, it returns an Edwards point, so how do I separate the multiplied (scalar * ed25519 base point) to their own respective points?
Aah sorry, that's the wrong API shape for your problem
Aah sorry, that's the wrong API shape for your problem
Ahh I think it is close. I just need to generate N scalars from entropy, and get the [for some scalar in scalars (scalar * ed25519 base point)] without having it dot product summed, but do so in an optimized, data-parallel way. This way I don't have to blow up the CPU cache while doing SIMD in lockstep.
Also, I tried to simplify the public key generation, is this correct?
let mut secret_key = SecretKey::default();
csprng.fill_bytes(&mut secret_key);
let pub_key =
EdwardsPoint::mul_base(&Scalar::from_bytes_mod_order(clamp_integer(
Sha512::default().chain_update(secret_key).finalize()[..32]
.try_into()
.unwrap(),
)))
.compress()
.to_bytes();
I'm trying to generate vanity wallet addresses, which are originally in ed25519, so I start by generating random bytes from a PRNG enough to cast into a secret key and then generate key pairs, but I ran into performance issue so I want to debug it:
On my 3700X PC, and running in full 16 threads, it only runs 300K keypairs/second with a precise monotonic clock. However, the C vanity address generator on my same PC, with same parameters can generate 30M keypairs/second with random data, so I think there could be something wrong here
So I started digging, which means we start from here: https://github.com/dalek-cryptography/curve25519-dalek/blob/d5ef57a3c2c4570720cafe8a112f5dd3ff56905e/ed25519-dalek/src/signing.rs#L102-L108
And then here: https://github.com/dalek-cryptography/curve25519-dalek/blob/d5ef57a3c2c4570720cafe8a112f5dd3ff56905e/ed25519-dalek/src/signing.rs#L788-L794
And then here: https://github.com/dalek-cryptography/curve25519-dalek/blob/d5ef57a3c2c4570720cafe8a112f5dd3ff56905e/ed25519-dalek/src/hazmat.rs#L57-L76
And then here: https://github.com/dalek-cryptography/curve25519-dalek/blob/d5ef57a3c2c4570720cafe8a112f5dd3ff56905e/curve25519-dalek/src/scalar.rs#L237-L246
And eventually here: https://github.com/dalek-cryptography/curve25519-dalek/blob/d5ef57a3c2c4570720cafe8a112f5dd3ff56905e/curve25519-dalek/src/scalar.rs#L1127-L1128
I noticed that despite I'm using AVX512, the flame graph shows that those
UnpackedScalar::mul_internal
andUnpackedScalar::montgomery_reduce
are still using generic serial implementation and took a large group of execution time, I'm not sure if this can be sped-up and improved? Or maybe I should look for something else for address generation?I noticed that the wallet vanity address generator in C uses SUPERCOP's version, and their ed25519 key generation is much faster as well.