In nbody some toolchains hoisting bodies_energy out of main loop

MarkHillHuawei commented 2 years ago

This issue is related to "Compiler optimization deletes the bodies_energy function call in the nbody benchmark."

https://github.com/embench/embench-iot/issues/16

Issue 16 was resolved and prevented the compiler eliminating the bodies_energy function from the main workload loop. However the fix does not prevent toolchains hoisting the function out of the loop and executing it only once instead of 100 times. When the compiler makes this optimisation the workload collapses to 100 floating point adds and results in a score 1000* that of an m4.

An example toolchain is armclang 6.14.1 (example options: --target=arm-arm-none-eabi -mcpu=cortex-m7 -mfloat-abi=hard -mfpu=fpv5-d16 -ffp-mode=fast -Os)

A possible fix is it create a pointer to the bodies_energy function, initialise it in initialise_benchmark() and call via it in the main loop.

jeremybennett commented 2 years ago

@MarkHillHuawei Thanks for this. We are revising the list of benchmarks for Embench 2.0. Options are to fix nbody as you suggest, or to drop it altogether. Do you think this is a good benchmark to keep (when fixed) in the suite?

MarkHillHuawei commented 2 years ago

I think my preference would be to drop this one, I'm sure we can find fp benchmarks that feel a bit more relevant. More generally I think it would be better to separate out the floating point benchmarks from the integer ones and perhaps report 2 to 3 scores, an int only, combined and may be fp only. This is because many embedded cores have no FPU so performance is strongly influenced by the quality of the software FP library. This is a useful thing to be able to benchmark but better done separately to an assessment of integer pipeline performance.

BrandonFrei commented 1 year ago

Is there any reason that there is no function to update the xyz coordinates? As of now, after the first call to offset_momentum, the velocities don't change for any of the particles in any direction, meaning a compiler could hoist this out of the loop as well. Having a more complete version of the N-Body problem could be a solution to not having the compiler optimize out the inner main workload loop.

Instead of having the energy loop run 100 times, it may be more beneficial to instead have more bodies. This increases the opportunity for vectorization.

Slightly unrelated: There is a fill variable defined in the body struct, but its value doesn't seem to be used.

Roger-Shepherd commented 1 year ago

Brandon,

N-body is due to be dropped from the next embench version. There is a view that an entirely floating point benchmark such as this is inappropriate as part of a deeply embedded benchmark suite, and there is the problem you identify. It seems that the Apple Mac development tools do "hoist this out of the loop” raising various issues such as:

Timing methodology

We attempt to run each benchmark for an amount of time which (i) short enough that we can get results reasonably quickly, (ii) is longer enough that quantisation of the clock used to measure run-time isn’t an issue.

For (ii) there is an issue about how to prolong execution - we choose to increase the number of iterations, this is sort of OK (ignoring the compiler issue) and, importantly, means the same work is done, but for longer. (This could be an issue with your suggestion of increasing run time by increasing number of bodies - although the behaviour of this benchmark is so simple it’s probably OK in this this instance).

The Mac results also show the problem that occurs when some aspect of one processor’s performance characteristics are very different to another’s - in this case hardware floating point. I see speed-ups that vary between 2.35x (crc32) and 3709.38x (nbody). The run-to-run variance of crc32 is OK but nobody varies wildly due to clock quantisation. If I run body longer enough to get steady results, the run time for the other benchmarks becomes very long.

This could be overcome by using a different scaling factor per benchmark (rather than per architecture). However, even if this were thought to be sensible, it would complicate the build/run system, and would take a significant amount of work to implement.

How to present results for processors with very different performance characteristics.

The group really hasn’t addressed this because there has been an assumption that the processors being benchmarked are similar, and that different characteristics would be assessed by multiple benchmark suites - e.g. “deeply embedded”, “dsp”.

Coping with compiler optimisations

Of course, there is a questions of what one is trying to compare - processors or process-tool pairs. The working assumption in embench is that we are comparing processors, and we will deal with improving compilers by issuing new benchmark suites as, over time, compilers "break” our benchmarks.

Roger

On 13 Mar 2023, at 22:10, Brandon Freiberger @.***> wrote:

Is there any reason that there is no function to update the xyz coordinates? As of now, after the first call to offset_momentum, the velocities don't change for any of the particles in any direction, meaning a compiler could hoist this out of the loop as well. Having a more complete version of the N-Body problem could be a solution to not having the compiler optimize out the inner main workload loop.

Instead of having the energy loop run 100 times, it may be more beneficial to instead have more bodies. This increases the opportunity for vectorization.

Slightly unrelated: There is a fill variable defined in the body struct, but its value doesn't seem to be used.

— Reply to this email directly, view it on GitHub https://github.com/embench/embench-iot/issues/160#issuecomment-1467037216, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC24LZIC7EEBCO6CM6QPGPLW36LN3ANCNFSM5VYUEQBQ. You are receiving this because you are subscribed to this thread.

-- Roger Shepherd @.***

BrandonFrei commented 1 year ago

Hi Roger,

Thanks for your reply. I appreciate you breaking down nbody and explaining the reasoning behind it being removed - I had assumed that the benchmarks were trying to compare process-tool pairs, not just processors.

Take care,

Brandon

On Thu, Mar 16, 2023 at 8:05 AM Roger Shepherd @.***> wrote:

Brandon,

N-body is due to be dropped from the next embench version. There is a view that an entirely floating point benchmark such as this is inappropriate as part of a deeply embedded benchmark suite, and there is the problem you identify. It seems that the Apple Mac development tools do "hoist this out of the loop” raising various issues such as:

Timing methodology

We attempt to run each benchmark for an amount of time which (i) short enough that we can get results reasonably quickly, (ii) is longer enough that quantisation of the clock used to measure run-time isn’t an issue.

For (ii) there is an issue about how to prolong execution - we choose to increase the number of iterations, this is sort of OK (ignoring the compiler issue) and, importantly, means the same work is done, but for longer. (This could be an issue with your suggestion of increasing run time by increasing number of bodies - although the behaviour of this benchmark is so simple it’s probably OK in this this instance).

The Mac results also show the problem that occurs when some aspect of one processor’s performance characteristics are very different to another’s - in this case hardware floating point. I see speed-ups that vary between 2.35x (crc32) and 3709.38x (nbody). The run-to-run variance of crc32 is OK but nobody varies wildly due to clock quantisation. If I run body longer enough to get steady results, the run time for the other benchmarks becomes very long.

This could be overcome by using a different scaling factor per benchmark (rather than per architecture). However, even if this were thought to be sensible, it would complicate the build/run system, and would take a significant amount of work to implement.

How to present results for processors with very different performance characteristics.

The group really hasn’t addressed this because there has been an assumption that the processors being benchmarked are similar, and that different characteristics would be assessed by multiple benchmark suites - e.g. “deeply embedded”, “dsp”.

Coping with compiler optimisations

Of course, there is a questions of what one is trying to compare - processors or process-tool pairs. The working assumption in embench is that we are comparing processors, and we will deal with improving compilers by issuing new benchmark suites as, over time, compilers "break” our benchmarks.

Roger

On 13 Mar 2023, at 22:10, Brandon Freiberger @.***> wrote:

Is there any reason that there is no function to update the xyz coordinates? As of now, after the first call to offset_momentum, the velocities don't change for any of the particles in any direction, meaning a compiler could hoist this out of the loop as well. Having a more complete version of the N-Body problem could be a solution to not having the compiler optimize out the inner main workload loop.

Instead of having the energy loop run 100 times, it may be more beneficial to instead have more bodies. This increases the opportunity for vectorization.

Slightly unrelated: There is a fill variable defined in the body struct, but its value doesn't seem to be used.

— Reply to this email directly, view it on GitHub < https://github.com/embench/embench-iot/issues/160#issuecomment-1467037216>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AC24LZIC7EEBCO6CM6QPGPLW36LN3ANCNFSM5VYUEQBQ . You are receiving this because you are subscribed to this thread.

-- Roger Shepherd @.***

— Reply to this email directly, view it on GitHub https://github.com/embench/embench-iot/issues/160#issuecomment-1471832376, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGCZFVTWF4COESZYXFR7M6TW4L6XRANCNFSM5VYUEQBQ . You are receiving this because you commented.Message ID: @.***>

embench / embench-iot

In nbody some toolchains hoisting bodies_energy out of main loop #160