debris / tiny-keccak

An implementation of Keccak derived functions specified in FIPS-202, SP800-185 and KangarooTwelve
Creative Commons Zero v1.0 Universal
193 stars 49 forks source link

Optimisations and convert to use crunchy #22

Closed eira-fransham closed 6 years ago

eira-fransham commented 6 years ago

Basically I just converted a bunch of mutable variables to be constants and made the array zeroing simpler/faster, mostly by using the macros in crunchy. Unrolling for i in 0..24 outer loop leads to a noticeable increase in compilation time but also massively increases the speed (probably because it allows more optimisations to be done on the array[i][...] accesses). I was looking at this repo because I wanted to convert it to use simd but it turns out there was some low-hanging optimisation fruit that doesn't require nightly.

Benchcmp results (I ran the benches 3 times each before and after this PR so there are 3 copies of each benching function):

 name                             pre.bench ns/iter  post.bench ns/iter  diff ns/iter   diff %  speedup 
 bench_sha3_256_input_4096_bytes  26,662 (153 MB/s)  20,100 (203 MB/s)         -6,562  -24.61%   x 1.33 
 bench_sha3_256_input_4096_bytes  26,203 (156 MB/s)  19,995 (204 MB/s)         -6,208  -23.69%   x 1.31 
 bench_sha3_256_input_4096_bytes  26,322 (155 MB/s)  20,157 (203 MB/s)         -6,165  -23.42%   x 1.31 
 keccakf_u64                      663 (37 MB/s)      496 (50 MB/s)               -167  -25.19%   x 1.34 
 keccakf_u64                      650 (38 MB/s)      491 (50 MB/s)               -159  -24.46%   x 1.32 
 keccakf_u64                      699 (35 MB/s)      486 (51 MB/s)               -213  -30.47%   x 1.44 

One unresolved question is whether the loop at line 75 should be replaced with something like:

for x in 0..5 {
    let mut out = 0;

    unroll! {
        for y_count in 0..5 {
            let y = y_count * 5;
            out ^= a[x + y];
        }
    }

    arrays[i][x] = out;
}

with mem::uninitialized for the initialisation of arrays. This should be more consistently optimised since it says what we actually want to happen, but on my computer it's slower, about 5%. If someone could test this against the version that's committed here on a different computer that'd be really useful.

debris commented 6 years ago

nice! I can confirm that on my machine it's also about 20% faster! 🎉

newpavlov commented 6 years ago

@Vurich Do you mind if I'll copy your code to the sha3 crate?

eira-fransham commented 6 years ago

@newpavlov It's CC0, you can copy whatever you like.