drujensen / fib

Performance Benchmark of top Github languages
860 stars 108 forks source link

Suggestion: align fib() function to DSB boundary (64 byte) #129

Open Kogia-sima opened 3 years ago

Kogia-sima commented 3 years ago

It seems that some benchmarks for statically compiled languages (C, Rust, Fortran etc.) heavily depends on where the instructions of fib() function will be placed. In modern processors, which has 64-byte DSB boundaries, small loops or recursive calls may fit in a single μops cache, but it depends on the code alignment.

Here is my experiments for C benchmark:

memory address total execution time [s]
0x1880 11.208
0x1890 11.323
0x18a0 13.320
0x18b0 10.769

This alignment issue causes different benchmark results on different platforms, compiler versions, compiler options (e.g. https://github.com/drujensen/fib/issues/28), or even what linker you use. If you stick with fib() function benchmark, you should manually specify function alignment to get consistent results.

drujensen-happymoney commented 3 years ago

Hi @Kogia-sima interesting stuff! The original goal was to show speed differences between Crystal and Ruby, but it's since grown to include more than just the top 10 languages.

I will look closer into the code alignment issue. Someone mentioned it before but didn't have a suggestion on how to address it. I think there are other issues with this benchmark when trying to compare languages like C vs Rust and I probably should add a disclaimer.

Kogia-sima commented 3 years ago

Even though another factor also affects the performance, I bet the code alignment is dominant here. For example, I see that Rust 1.42.0 and 1.50.0 produces different results for 16-byte alignment, but almost same results for 64-bytes alignment.

In the case of Rust, there is actually one more factor that affects performance: LLVM inserts nop before loops to avoid alignment issue on some processors. With this padding fib() functions exceeds 64 bytes, which may cause DSB cache misses (not always, but under some situations). You can avoid this behavior by passing -C llvm-args=-x86-experimental-pref-loop-alignment=0 to rustc. When I specified this flag, I see that C, C++, and Rust all results in same performance.

The original goal was to show speed differences between Crystal and Ruby

I understood your goals, so the best solution would be to add proper disclaimer to readme.

Kogia-sima commented 3 years ago

Here is another experiments to prove that 64 bytes alignment produces consistent results.

memory address total execution time [s]
0x1a00 11.205
0x19c0 11.196
0x1980 11.232
0x1940 11.204