Benchmarking - Githubissues

johnhringiv commented 6 years ago

I have a machine I want to LTO and was wondering if anyone would be interested in some before and after benchmarks. If so what would be useful to run?

InBetweenNames commented 6 years ago

Yeah I've seen interest in having benchmarks with a fully LTO system. Unfortunately, no one has suggested any benchmarks to actually run. I was looking into "system responsiveness" benchmarks a bit, but found no existing suites.

asaparov commented 6 years ago

I would also be really interested in general system benchmarks. I would totally try this but I'm not sure if its worth the time investment, and some benchmarks would be really helpful. Even general system benchmarks would be useful, like those used by Phoronix (e.g. here).

The tests should be available in the Phoronix Test Suite.

asaparov commented 6 years ago

I was curious so I decided to run some of the benchmarks: https://openbenchmarking.org/result/1807107-AR-MERGE406580

I only messed around with O3, Graphite, LTO, and some PGO for gcc+python. I didn't try more aggressive optimizations. I used gcc version 7.3.0.

Explanation of configuration names:

nolto_O2: Everything compiled with -O2, and no LTO, Graphite, or PGO. Benchmarks compiled with default flags.
lto_O3_graphite_kernel_nolto_O2: world compiled with -O3, LTO, Graphite, and PGO for gcc. Python not compiled with PGO. Kernel compiled with -O2, and no LTO, Graphite, or PGO. Benchmarks compiled with default flags.
lto_O3_graphite_kernel_nolto_O2_test_optimized: world and benchmarks compiled with -O3, LTO, Graphite, and PGO for gcc. Python not compiled with PGO. Kernel compiled with -O2, and no LTO, Graphite, or PGO.
lto_O3_graphite_kernel_nolto_O3: world compiled with -O3, LTO, Graphite, and PGO for gcc. Python not compiled with PGO. Kernel compiled with -O3 -march=native. Benchmarks compiled with default flags.
lto_O3_graphite_kernel_nolto_O3_test_optimized: world and benchmarks compiled with -O3, LTO, Graphite, and PGO for gcc. Python not compiled with PGO. Kernel compiled with -O3 -march=native.

For PyBench, we have two more configurations:

lto_O3_graphite_kernel_nolto_O3_python_pgo: world compiled with -O3, LTO, Graphite, and PGO for gcc and python. Kernel compiled with -O3 -march=native. Benchmarks compiled with default flags.
lto_O3_graphite_kernel_nolto_O3_python_pgo_test_optimized: world and benchmarks compiled with -O3, LTO, Graphite, and PGO for gcc and python. Kernel compiled with -O3 -march=native.

Some analysis of the results:

6 of the 14 benchmarks show no difference in performance.
compilebench, Timed Linux Kernel Compilation, and Apache benefit from further optimizations.
PyBench very slightly benefits from further optimizations, and it benefits considerably from python PGO.
ParaView is sometimes hurt by further optimizations.
Himeno and Stockfish benefit when the benchmark itself is further optimized, but there's no real change if only the system is further optimized.
ebizzy is very slightly hurt when the benchmark itself is further optimized, but there's no real change if only the system is further optimized.

So all-in-all, there are more benchmarks that benefit from O3+LTO+Graphite than don't (4 vs 1). That difference becomes 6 vs 2 if you also include results where the benchmark itself was further optimized, which for Gentoo, is a fairer comparison since almost everything is built from source anyway.

I would be interested in trying more aggressive flags; I'm open to suggestions.

In my experience, rebuilding world with O3+LTO+Graphite was actually pretty easy, thanks to this overlay. But having to recompile world with every gcc update doesn't sound very fun. Thankfully, gcc updates are few and far between in gentoo unstable (which is what I use). Compiling the kernel with more aggressive optimizations was more difficult. I ended up replacing all instances of -O2 in the Makefiles with the desired flags (does anyone know how to do this more easily?).

asaparov commented 6 years ago

Actually, thinking about the results a bit more, I realize that the benefit seen in compilebench and Timed Linux Kernel Compilation could have come from PGO in gcc. So I re-ran those two tests using gcc without PGO: https://openbenchmarking.org/result/1807119-AR-1807110AR51

Surprisingly, it turns out that compilebench runs even faster after disabling PGO in gcc. Less surprisingly, Timed Linux Kernel Compilation is about the same speed as -O2 without LTO or Graphite.

InBetweenNames commented 6 years ago

Wow, this is excellent! So we're at least not seeing any real detrimental effects from having aggressive compiler optimizations on, and in some cases we're seeing a noticeable improvement. I could run the tests on my system, but I don't have an -O2 baseline to compare against at the moment. Thanks for taking the time to do this!

InBetweenNames commented 6 years ago

@asaparov by the way, forgot to mention: if you want to inject your own build flags into the kernel, you can use the KCFLAGS variable:

make -j12 KCFLAGS="-O3 -fgraphite-identity -ftree-loop-distribution -floop-nest-optimize -fipa-pta"

I was able to build my kernel with LTO, but unfortunately the proprietary modules I require don't seem to play nice with LTO, so I'm stuck with the above flags.

asaparov commented 6 years ago

Oh that works well, thanks! I had tried CFLAGS_KERNEL and KBUILD_CFLAGS earlier and they would weren't working.

sjnewbury commented 5 years ago

I try to keep my nbench results moving in the right direction. I should have kept them for reference, right now I get with my current set of flags (-autopar) on a mostly idle system

(bare in mind it does have quite slow memory because I'm on a budget and re-used my existing DDR3 RAM when I built this machine)

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          1762.7  :      45.21  :      14.85
STRING SORT         :          761.78  :     340.38  :      52.69
BITFIELD            :      6.6259e+08  :     113.66  :      23.74
FP EMULATION        :           332.5  :     159.55  :      36.82
FOURIER             :           49826  :      56.67  :      31.83
ASSIGNMENT          :          48.077  :     182.94  :      47.45
IDEA                :           14251  :     217.96  :      64.71
HUFFMAN             :          4265.6  :     118.28  :      37.77
NEURAL NET          :           86.66  :     139.21  :      58.56
LU DECOMPOSITION    :          3223.9  :     167.01  :     120.60
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 144.508
FLOATING-POINT INDEX: 109.623
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : 8 CPU AuthenticAMD AMD FX(tm)-9370 Eight-Core Processor 2124MHz
L2 Cache            : 2048 KB
OS                  : Linux 4.18.13-gentoo
C compiler          : x86_64-pc-linux-gnu-gcc
libc                : 
MEMORY INDEX        : 39.007
INTEGER INDEX       : 33.998
FLOATING-POINT INDEX: 60.801
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.

What's quite interesting is how sensitive the various benchmarks is to L1 cache residency and others to high optimisation.

An old Athlon X2: baseline CFLAGS="-O2 -march=native" (although glibc is strongly optimized)

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :            1330  :      34.11  :      11.20
STRING SORT         :          300.38  :     134.22  :      20.77
BITFIELD            :       5.847e+08  :     100.30  :      20.95
FP EMULATION        :          291.18  :     139.72  :      32.24
FOURIER             :           27476  :      31.25  :      17.55
ASSIGNMENT          :           28.64  :     108.98  :      28.27
IDEA                :          8640.2  :     132.15  :      39.24
HUFFMAN             :          2834.7  :      78.61  :      25.10
NEURAL NET          :            46.3  :      74.38  :      31.29
LU DECOMPOSITION    :          1675.5  :      86.80  :      62.68
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 95.533
FLOATING-POINT INDEX: 58.647
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : Dual AuthenticAMD AMD Athlon(tm) 64 X2 Dual Core Processor 6000+ 3000MHz
L2 Cache            : 1024 KB
OS                  : Linux 4.6.2-gentoo
C compiler          : x86_64-pc-linux-gnu-gcc
libc                : 
MEMORY INDEX        : 23.085
INTEGER INDEX       : 24.421
FLOATING-POINT INDEX: 32.528
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.

In combination with my set of flags it gets a much better NUMERIC SORT result than the above AMD FX when compiled with -Os while some other benchmarks suffer:

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          2154.9  :      55.26  :      18.15
STRING SORT         :          294.24  :     131.47  :      20.35
BITFIELD            :      7.4313e+08  :     127.47  :      26.63
FP EMULATION        :          100.86  :      48.40  :      11.17
FOURIER             :           27307  :      31.06  :      17.44
ASSIGNMENT          :          32.688  :     124.38  :      32.26
IDEA                :          7363.7  :     112.63  :      33.44
HUFFMAN             :          2140.8  :      59.36  :      18.96
NEURAL NET          :          43.025  :      69.12  :      29.07
LU DECOMPOSITION    :          1652.6  :      85.61  :      61.82
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 86.851
FLOATING-POINT INDEX: 56.851
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : Dual AuthenticAMD AMD Athlon(tm) 64 X2 Dual Core Processor 6000+ 3000MHz
L2 Cache            : 1024 KB
OS                  : Linux 4.6.2-gentoo
C compiler          : x86_64-pc-linux-gnu-gcc
libc                : 
MEMORY INDEX        : 25.953
INTEGER INDEX       : 18.933
FLOATING-POINT INDEX: 31.532
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.

While with -Ofast it's quite different:

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          1431.3  :      36.71  :      12.06
STRING SORT         :           302.9  :     135.34  :      20.95
BITFIELD            :      5.2024e+08  :      89.24  :      18.64
FP EMULATION        :          398.81  :     191.37  :      44.16
FOURIER             :           26144  :      29.73  :      16.70
ASSIGNMENT          :          33.852  :     128.81  :      33.41
IDEA                :          8664.1  :     132.52  :      39.34
HUFFMAN             :          2521.9  :      69.93  :      22.33
NEURAL NET          :          58.367  :      93.76  :      39.44
LU DECOMPOSITION    :          1631.5  :      84.52  :      61.03
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 100.182
FLOATING-POINT INDEX: 61.763
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : Dual AuthenticAMD AMD Athlon(tm) 64 X2 Dual Core Processor 6000+ 3000MHz
L2 Cache            : 1024 KB
OS                  : Linux 4.6.2-gentoo
C compiler          : x86_64-pc-linux-gnu-gcc
libc                : 
MEMORY INDEX        : 23.541
INTEGER INDEX       : 26.152
FLOATING-POINT INDEX: 34.256
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.

At least it's faster than baseline :-) ... and yes, I really do need to update the kernel on that machine!!

For reference, the old nbench results page is here: http://web.archive.org/web/20160706230749/http://www.tux.org:80/~mayer/linux/results2.html

InBetweenNames / gentooLTO

Benchmarking #123