dalek-cryptography / curve25519-dalek

A pure-Rust implementation of group operations on Ristretto and Curve25519
Other
876 stars 443 forks source link

Help out for batched random key generation? #705

Open stevefan1999-personal opened 1 day ago

stevefan1999-personal commented 1 day ago

I'm trying to generate vanity wallet addresses, which are originally in ed25519, so I start by generating random bytes from a PRNG enough to cast into a secret key and then generate key pairs, but I ran into performance issue so I want to debug it:

for _ in 0..cli.threads {
    let thr = {
        let counter = counter.clone();
        move || {
            let mut csprng = fastrand::Rng::new();
            let mut pair = SecretKey::default();

            loop {
                let key: SigningKey = SigningKey::from_bytes(&pair);
                counter.fetch_add(1, Ordering::Relaxed);
            }
        }
    };

    thread::spawn(thr);
}

On my 3700X PC, and running in full 16 threads, it only runs 300K keypairs/second with a precise monotonic clock. However, the C vanity address generator on my same PC, with same parameters can generate 30M keypairs/second with random data, so I think there could be something wrong here

So I started digging, which means we start from here: https://github.com/dalek-cryptography/curve25519-dalek/blob/d5ef57a3c2c4570720cafe8a112f5dd3ff56905e/ed25519-dalek/src/signing.rs#L102-L108

And then here: https://github.com/dalek-cryptography/curve25519-dalek/blob/d5ef57a3c2c4570720cafe8a112f5dd3ff56905e/ed25519-dalek/src/signing.rs#L788-L794

And then here: https://github.com/dalek-cryptography/curve25519-dalek/blob/d5ef57a3c2c4570720cafe8a112f5dd3ff56905e/ed25519-dalek/src/hazmat.rs#L57-L76

And then here: https://github.com/dalek-cryptography/curve25519-dalek/blob/d5ef57a3c2c4570720cafe8a112f5dd3ff56905e/curve25519-dalek/src/scalar.rs#L237-L246

And eventually here: https://github.com/dalek-cryptography/curve25519-dalek/blob/d5ef57a3c2c4570720cafe8a112f5dd3ff56905e/curve25519-dalek/src/scalar.rs#L1127-L1128

I noticed that despite I'm using AVX512, the flame graph shows that those UnpackedScalar::mul_internal and UnpackedScalar::montgomery_reduce are still using generic serial implementation and took a large group of execution time, I'm not sure if this can be sped-up and improved? Or maybe I should look for something else for address generation?

I noticed that the wallet vanity address generator in C uses SUPERCOP's version, and their ed25519 key generation is much faster as well.

stevefan1999-personal commented 1 day ago

I made a minimally viable benchmark program here in https://github.com/stevefan1999-personal/ed25519-gen-speed-test. I used the fastest random number generator possible to get rid of its effect.

On my 3700X PC, this generates:

7 keys per second on average
268020 keys per second on average
284770 keys per second on average
264184 keys per second on average
252750 keys per second on average
245968 keys per second on average
241423 keys per second on average
237701 keys per second on average
234541 keys per second on average
232072 keys per second on average
229985 keys per second on average
228280 keys per second on average
stevefan1999-personal commented 1 day ago

I reran the key generation benchmark with "fast" featured enabled back on for ed25519-dalek, and performance improved 3 times, and this is the hot path tree diagram that I got image

tarcieri commented 1 day ago

Take a look at the MultiscalarMul trait: https://docs.rs/curve25519-dalek/4.1.3/curve25519_dalek/edwards/struct.EdwardsPoint.html#impl-MultiscalarMul-for-EdwardsPoint

stevefan1999-personal commented 1 day ago

Take a look at the MultiscalarMul trait: https://docs.rs/curve25519-dalek/4.1.3/curve25519_dalek/edwards/struct.EdwardsPoint.html#impl-MultiscalarMul-for-EdwardsPoint

May I get some papers or lecture notes on how to use that?

tarcieri commented 1 day ago

It takes an iterator over Scalars and an iterator over EdwardsPoints and batch multiplies the scalars by the points, so you can generate a batch of candidate scalars and multiply several of them by the base point at once

stevefan1999-personal commented 1 day ago

It takes an iterator over Scalars and an iterator over EdwardsPoints and batch multiplies the scalars by the points, so you can generate a batch of candidate scalars and multiply several of them by the base point at once

If I read that correctly, I can use multiscalar_mul to do something that means scalar * constants::ED25519_BASEPOINT_TABLE but similar to a dot product operation instead, right?

stevefan1999-personal commented 1 day ago

Okay, so I came up with this:

    let mut rng = Rng::new();

    let scalars: Vec<ExpandedSecretKey> = (0..16)
        .map(|_| {
            let mut secret_key: SecretKey = SecretKey::default();
            rng.fill(&mut secret_key);
            ExpandedSecretKey::from(&secret_key)
        })
        .collect();

    EdwardsPoint::multiscalar_mul(
        scalars.iter().map(|x| x.scalar),
        scalars.iter().map(|_| constants::ED25519_BASEPOINT_POINT),
    );

However, it returns an Edwards point, so how do I separate the multiplied (scalar * ed25519 base point) to their own respective points?

tarcieri commented 1 day ago

Aah sorry, that's the wrong API shape for your problem

stevefan1999-personal commented 1 day ago

Aah sorry, that's the wrong API shape for your problem

Ahh I think it is close. I just need to generate N scalars from entropy, and get the [for some scalar in scalars (scalar * ed25519 base point)] without having it dot product summed, but do so in an optimized, data-parallel way. This way I don't have to blow up the CPU cache while doing SIMD in lockstep.

Also, I tried to simplify the public key generation, is this correct?

                    let mut secret_key = SecretKey::default();
                    csprng.fill_bytes(&mut secret_key);
                    let pub_key =
                        EdwardsPoint::mul_base(&Scalar::from_bytes_mod_order(clamp_integer(
                            Sha512::default().chain_update(secret_key).finalize()[..32]
                                .try_into()
                                .unwrap(),
                        )))
                        .compress()
                        .to_bytes();