JuliaPerf / BenchmarksGame.jl

Other
43 stars 13 forks source link

WIP: Hacky(!) but faster nbody implementations #44

Open smallnamespace opened 4 years ago

smallnamespace commented 4 years ago

Here are a couple of messy implementations that, at least on my laptop with --target-cpu=core2 (the architecture of the actual BenchmarksGame test machine), beat the current ...simd.jl by about 60% and 40%:

impl 1st run 2nd run speedup vs. simd
simd 5.95s 5.75s -
unsafe_simd 7.6s 4.15s 40%
unsafe_simd_unroll 7.3s 3.6s 60%
Rust#7 - 3.1s 85%

Would like some feedback before cleaning this up further (and getting too deep in this rabbit hole 🙂), in particular whether this is helpful for showing off the language, since the code is getting far from idiomatic Julia.

A few caveats:

Btw, the ...unroll.jl file has a hacky macro that fully unrolls some of the inner loops. This mimics how Rust #7 achieves its speedup: rustc is smart enough to automatically unroll the (outer) for loops inside advance, e.g. rsqrt is seen 5 times in decompiled asm.

I didn't go all the way to unrolling the stride-2 loop, but could be persuaded to hack something up just to see how much improvement can be found.

@KristofferC Thanks again for your help getting intrinsics working.

non-Jedi commented 4 years ago

Very cool. I think we can probably polish it to be more idiomatic Julia over time. I've been trying to get this one faster using simd intrinsics on my machine for a while now and mostly failing.

My preference would be to just use NTuples with unsafe_store! for now. I think getting AOT compilation working consistently on the benchmarks-game machine might be a ways off. And it might not be accepted at all depending on whether the maintainer wants to deal with the headache.