WIP: Hacky(!) but faster nbody implementations

Here are a couple of messy implementations that, at least on my laptop with --target-cpu=core2 (the architecture of the actual BenchmarksGame test machine), beat the current ...simd.jl by about 60% and 40%:

impl	1st run	2nd run	speedup vs. simd
simd	5.95s	5.75s	-
unsafe_simd	7.6s	4.15s	40%
unsafe_simd_unroll	7.3s	3.6s	60%
Rust#7	-	3.1s	85%

Would like some feedback before cleaning this up further (and getting too deep in this rabbit hole 🙂), in particular whether this is helpful for showing off the language, since the code is getting far from idiomatic Julia.

A few caveats:

Not idiomatic Julia because we're porting gcc #4 and rust #7, which liberally use SIMD intrinsics, lay memory out by hand, etc.
Compilation time is much longer (probably due to using StaticArrays), so this awaits Julia AOT (#35) to show real gains; or I can switch using NTuples with unsafe stores.

Btw, the ...unroll.jl file has a hacky macro that fully unrolls some of the inner loops. This mimics how Rust #7 achieves its speedup: rustc is smart enough to automatically unroll the (outer) for loops inside advance, e.g. rsqrt is seen 5 times in decompiled asm.

I didn't go all the way to unrolling the stride-2 loop, but could be persuaded to hack something up just to see how much improvement can be found.

@KristofferC Thanks again for your help getting intrinsics working.

JuliaPerf / BenchmarksGame.jl

WIP: Hacky(!) but faster nbody implementations #44