Basically I just converted a bunch of mutable variables to be constants and made the array zeroing simpler/faster, mostly by using the macros in crunchy. Unrolling for i in 0..24 outer loop leads to a noticeable increase in compilation time but also massively increases the speed (probably because it allows more optimisations to be done on the array[i][...] accesses). I was looking at this repo because I wanted to convert it to use simd but it turns out there was some low-hanging optimisation fruit that doesn't require nightly.
Benchcmp results (I ran the benches 3 times each before and after this PR so there are 3 copies of each benching function):
One unresolved question is whether the loop at line 75 should be replaced with something like:
for x in 0..5 {
let mut out = 0;
unroll! {
for y_count in 0..5 {
let y = y_count * 5;
out ^= a[x + y];
}
}
arrays[i][x] = out;
}
with mem::uninitialized for the initialisation of arrays. This should be more consistently optimised since it says what we actually want to happen, but on my computer it's slower, about 5%. If someone could test this against the version that's committed here on a different computer that'd be really useful.
Basically I just converted a bunch of mutable variables to be constants and made the array zeroing simpler/faster, mostly by using the macros in
crunchy
. Unrollingfor i in 0..24
outer loop leads to a noticeable increase in compilation time but also massively increases the speed (probably because it allows more optimisations to be done on thearray[i][...]
accesses). I was looking at this repo because I wanted to convert it to use simd but it turns out there was some low-hanging optimisation fruit that doesn't require nightly.Benchcmp results (I ran the benches 3 times each before and after this PR so there are 3 copies of each benching function):
One unresolved question is whether the loop at line 75 should be replaced with something like:
with
mem::uninitialized
for the initialisation ofarrays
. This should be more consistently optimised since it says what we actually want to happen, but on my computer it's slower, about 5%. If someone could test this against the version that's committed here on a different computer that'd be really useful.