chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.78k stars 418 forks source link

Can you make our n-body benchmark faster? #19237

Open aconsroe-hpe opened 2 years ago

aconsroe-hpe commented 2 years ago

Today, Chapel is about 3x slower than the top entry in the Computer Language Benchmarks Game's nbody problem.

Your mission, if you choose to accept it, is to try to speed it up.

I suggest using nbody.chpl as a starting point and from there, a few interesting directions to head in are:

1) Unroll the outer, inner, or both loops in advance() (and possibly energy(), though its only called twice)

2) Rewrite the "object centric" into a "data centric" implementation

3) Try to get some better vectorization

Some combination of 2 & 3 may yield compound improvements, because you could lay out all velocities as continguous 4*real

Our primary interest is increased performance but without impacting the cleanliness of the code. That being said, a version that was as fast as possible is also fair game.

rivalq commented 2 years ago

Hii, nbody.chpl link is not working

aconsroe-hpe commented 2 years ago

Fixed the link @rivalq , thanks for pointing it out

rivalq commented 2 years ago

Just a thought, in advance there are 10 combinatons, if we use multithreading here, as x,y,z are independent. I have no idea which will be faster, multithreading or unrolling the loops .

bradcray commented 2 years ago

@rivalq: Our assumption is that multithreading/multitasking is too heavyweight for this benchmark, since the number of bodies and computation per body is so small. Notably, none of the top entries seem to be using multithreading based on their CPU loads. Instead, this benchmark seems to be all about best use of vectorization, where our desired approach would be to not use direct vector instructions/intrinsics like many of the very top entries, but rather to (hopefully) communicate the computation's intent clearly enough to the back-end compiler(s) that it vectorizes on our behalf. Using this technique, we probably will not become one of the very best entries, but that's not a problem. It seems like there are some "clean" / non-heroic entries that are around ~2x the best, where we're currently sitting more like ~3x. The idea about unrolling was essentially in this spirit: Will the vectorizing compiler do a better job if the loops are fully unrolled by converting them into param loops.

jabraham17 commented 2 months ago

Currently Chapel's n-body submission is the number 1 entry. Closing this as completed

bradcray commented 2 months ago

@jabraham17 : I think we should potentially keep this open (though potentially with some updates to the OP since it was created). Even though the Chapel #3 version is the fastest of the versions that don't fall into the hand-written vector instructions | "unsafe" category, it's still 2x slower than the fastest C version there, so arguably there's still more we could do to close that gap.

bradcray commented 2 months ago

Also, I slacked @mppf last week to point out that, on our single-locale performance testing, at least, his "no cube" version is now outperforming our submitted #3 version, and seemed to take a leap forward as a result of Jade's change to that configuration's CHPL_TARGET_CPU setting. There's no guarantee that this improvement will translate to the CLBG system itself, but it seems worth submitting based on that since better performance was the motivation for writing it that way.