Can you make our n-body benchmark faster?

aconsroe-hpe commented 2 years ago

Today, Chapel is about 3x slower than the top entry in the Computer Language Benchmarks Game's nbody problem.

Your mission, if you choose to accept it, is to try to speed it up.

I suggest using nbody.chpl as a starting point and from there, a few interesting directions to head in are:

1) Unroll the outer, inner, or both loops in advance() (and possibly energy(), though its only called twice)

Chapel supports unrolling with a for param i in 0..<numBodies
numBodies has to be made param to be used in a for param
Alt: some implementations precompute the list of (i, j) pairs since numBodies is known for this benchmark
- Can you also make this param? And then for param over those?

2) Rewrite the "object centric" into a "data centric" implementation

Today the calculations are focused on the record body and stored in an array-of-structs fashion
Other implementations use a struct-of-arrays where all positions for example are stored contiguously
Does this make the code harder to follow (this will vary by person and familiarity)? Is the code shorter or longer?

3) Try to get some better vectorization

In some quick experiments, we see that operations on a 3*real get vectorized, but does not utilize a full 256bit vector instruction (it does a 2x64 and 1x64; ideally we would get a 3x64 but the hardware needs to see a 4x64)
- Can you expand to a 4*real? Is there a nicer user interface to write 3*real but really get 4*real without having to insert 0's everywhere?
A very specific optimization other implementations do is a vectorized reciprocal square-root (vrsqrt) on 4x64. If you look closely though, they have to do so by first doing a vrsqrt on 4x32 because that is what the hardware supports. They then fixup the precision to get a better 64 bit result.
- Can you write this in an extern block in Chapel? How clunky is it to use?
- Without resorting to any extern blocks, can we get similar benefit by having the code look more obviously to the backend like 1 / sqrt(x:real(32))?
- Does this optimization depend on -ffast-math? (or --no-ieee-float)

Some combination of 2 & 3 may yield compound improvements, because you could lay out all velocities as continguous 4*real

Our primary interest is increased performance but without impacting the cleanliness of the code. That being said, a version that was as fast as possible is also fair game.

rivalq commented 2 years ago

Hii, nbody.chpl link is not working

aconsroe-hpe commented 2 years ago

Fixed the link @rivalq , thanks for pointing it out

rivalq commented 2 years ago

Just a thought, in advance there are 10 combinatons, if we use multithreading here, as x,y,z are independent. I have no idea which will be faster, multithreading or unrolling the loops .

bradcray commented 2 years ago

@rivalq: Our assumption is that multithreading/multitasking is too heavyweight for this benchmark, since the number of bodies and computation per body is so small. Notably, none of the top entries seem to be using multithreading based on their CPU loads. Instead, this benchmark seems to be all about best use of vectorization, where our desired approach would be to not use direct vector instructions/intrinsics like many of the very top entries, but rather to (hopefully) communicate the computation's intent clearly enough to the back-end compiler(s) that it vectorizes on our behalf. Using this technique, we probably will not become one of the very best entries, but that's not a problem. It seems like there are some "clean" / non-heroic entries that are around ~2x the best, where we're currently sitting more like ~3x. The idea about unrolling was essentially in this spirit: Will the vectorizing compiler do a better job if the loops are fully unrolled by converting them into param loops.

jabraham17 commented 2 months ago

Currently Chapel's n-body submission is the number 1 entry. Closing this as completed

bradcray commented 2 months ago

@jabraham17 : I think we should potentially keep this open (though potentially with some updates to the OP since it was created). Even though the Chapel #3 version is the fastest of the versions that don't fall into the hand-written vector instructions | "unsafe" category, it's still 2x slower than the fastest C version there, so arguably there's still more we could do to close that gap.

bradcray commented 2 months ago

Also, I slacked @mppf last week to point out that, on our single-locale performance testing, at least, his "no cube" version is now outperforming our submitted #3 version, and seemed to take a leap forward as a result of Jade's change to that configuration's CHPL_TARGET_CPU setting. There's no guarantee that this improvement will translate to the CLBG system itself, but it seems worth submitting based on that since better performance was the motivation for writing it that way.

chapel-lang / chapel

Can you make our n-body benchmark faster? #19237