Open aconsroe-hpe opened 2 years ago
Hii, nbody.chpl
link is not working
Fixed the link @rivalq , thanks for pointing it out
Just a thought, in advance
there are 10 combinatons, if we use multithreading here, as x,y,z
are independent. I have no idea which will be faster, multithreading or unrolling the loops .
@rivalq: Our assumption is that multithreading/multitasking is too heavyweight for this benchmark, since the number of bodies and computation per body is so small. Notably, none of the top entries seem to be using multithreading based on their CPU loads. Instead, this benchmark seems to be all about best use of vectorization, where our desired approach would be to not use direct vector instructions/intrinsics like many of the very top entries, but rather to (hopefully) communicate the computation's intent clearly enough to the back-end compiler(s) that it vectorizes on our behalf. Using this technique, we probably will not become one of the very best entries, but that's not a problem. It seems like there are some "clean" / non-heroic entries that are around ~2x the best, where we're currently sitting more like ~3x. The idea about unrolling was essentially in this spirit: Will the vectorizing compiler do a better job if the loops are fully unrolled by converting them into param loops.
Currently Chapel's n-body submission is the number 1 entry. Closing this as completed
@jabraham17 : I think we should potentially keep this open (though potentially with some updates to the OP since it was created). Even though the Chapel #3 version is the fastest of the versions that don't fall into the hand-written vector instructions | "unsafe"
category, it's still 2x slower than the fastest C version there, so arguably there's still more we could do to close that gap.
Also, I slacked @mppf last week to point out that, on our single-locale performance testing, at least, his "no cube" version is now outperforming our submitted #3 version, and seemed to take a leap forward as a result of Jade's change to that configuration's CHPL_TARGET_CPU setting. There's no guarantee that this improvement will translate to the CLBG system itself, but it seems worth submitting based on that since better performance was the motivation for writing it that way.
Today, Chapel is about 3x slower than the top entry in the Computer Language Benchmarks Game's nbody problem.
Your mission, if you choose to accept it, is to try to speed it up.
I suggest using nbody.chpl as a starting point and from there, a few interesting directions to head in are:
1) Unroll the outer, inner, or both loops in
advance()
(and possiblyenergy()
, though its only called twice)for param i in 0..<numBodies
numBodies
has to be madeparam
to be used in afor param
(i, j)
pairs sincenumBodies
is known for this benchmarkparam
? And thenfor param
over those?2) Rewrite the "object centric" into a "data centric" implementation
record body
and stored in an array-of-structs fashion3) Try to get some better vectorization
3*real
get vectorized, but does not utilize a full 256bit vector instruction (it does a 2x64 and 1x64; ideally we would get a 3x64 but the hardware needs to see a 4x64)4*real
? Is there a nicer user interface to write3*real
but really get4*real
without having to insert 0's everywhere?vrsqrt
) on 4x64. If you look closely though, they have to do so by first doing avrsqrt
on 4x32 because that is what the hardware supports. They then fixup the precision to get a better 64 bit result.1 / sqrt(x:real(32))
?-ffast-math
? (or--no-ieee-float
)Some combination of 2 & 3 may yield compound improvements, because you could lay out all velocities as continguous
4*real
Our primary interest is increased performance but without impacting the cleanliness of the code. That being said, a version that was as fast as possible is also fair game.