Open smallnamespace opened 4 years ago
Very cool. I think we can probably polish it to be more idiomatic Julia over time. I've been trying to get this one faster using simd intrinsics on my machine for a while now and mostly failing.
My preference would be to just use NTuples with unsafe_store! for now. I think getting AOT compilation working consistently on the benchmarks-game machine might be a ways off. And it might not be accepted at all depending on whether the maintainer wants to deal with the headache.
Here are a couple of messy implementations that, at least on my laptop with
--target-cpu=core2
(the architecture of the actual BenchmarksGame test machine), beat the current ...simd.jl
by about 60% and 40%:Would like some feedback before cleaning this up further (and getting too deep in this rabbit hole 🙂), in particular whether this is helpful for showing off the language, since the code is getting far from idiomatic Julia.
A few caveats:
StaticArrays
), so this awaits Julia AOT (#35) to show real gains; or I can switch usingNTuple
s with unsafe stores.Btw, the ...
unroll.jl
file has a hacky macro that fully unrolls some of the inner loops. This mimics how Rust #7 achieves its speedup:rustc
is smart enough to automatically unroll the (outer) for loops insideadvance
, e.g.rsqrt
is seen 5 times in decompiled asm.I didn't go all the way to unrolling the stride-2 loop, but could be persuaded to hack something up just to see how much improvement can be found.
@KristofferC Thanks again for your help getting intrinsics working.