[Optimization] "automatic" compiler flag tuning AKA getting sciencey with CFLAGS

An implicit part of this project, especially as we look at borrowing tricks for Clear Linux, is identifying appropriate compiler flags for global use (also we have the power to tune flags per-package, but that doesn't seem to be the main goal).

There are a host of gcc flags available and some may only benefit certain programs and harm others. The idea behind autotuning is to have a collection of small benchmark programs and run a search in flagspace to minimize the collective runtimes of the benchmarks, i.e. find the best flags that usually work and don't carry large penalties when they don't.

The first such tool I encountered, many years ago, was Acovea, which used a genetic algorithm to drive its search. I was never particularly satisfied with its results, but one nice idea that it had was to identify flags that were particularly influential (good or bad) because many flags are barely perceptible. I got the bug to check it out again recently, but a few limitations were grating:

a single benchmark is explored per run (spends time chasing optimizations that will be pessimistic for other benchmarks)
single measurement at a time (sure there will be less noise in the data, but with manycore systems it is likely better to use the available parallelism and do some extra runs to improve statistical accuracy and counteract the noise)
number of executions isn't really driven by how effective additional data is at improving tuning, you just kinda guess how long you can stand to let it run
the timing must come as the sole program output, which is only evaluated once (which made it more complicated to verify correctness wasn't violated, and prevents the use of tools for doing multiple runs to get more accurate measurements).

So, by chance I found a talk on using IRace, a general purpose optimizer to tune GCC flags.

the IRace talk on youtube: Automatic Tuning of Compiler Options using irace - GNU Tools Cauldron 2018
Tech Report
IRace user guide on CRAN
The script files and other good stuff is Here

A few other packages with similar aims are mentioned in the talk but the author claims that these approaches are non-optimal/slower/tightly coupled to the compiler.

IRace seems to address most of my critiques of Acovea:

theoretical backing for its approach to search
evaluates tunings across multiple benchmarks
can make use of parallelism
execution budgets in # of experiments and even time (though this is flakey for programs with sub-second runtimes, I think)
user-supplied configuration evaluation script that can verify correctness and do multiple runs if desired

I used IRace for a few weeks to tune flags for a research program. I then used this learning to set up a run to tune GCC 8.3 flags for the Acovea benchmark set. I set -O2 -march+native as the baseline, then added all the flags enabled by -O3 and then most of the flags not enabled by -O3 that aren't related to an alternate instruction scheduler, fortran, security features, implied by ffastmath, or tunable parameters.

I was running for about a week when gcc 9.1 came out, so I decided to kill the job and upgrade. This long winded post is all to say: I thought ya'll might be interested in the preliminary results after 10,185,729 executions:

# 2019-05-04 14:04:50 PDT: Elite configurations (first number is the configuration ID; listed from best to worst according to the mean value):                                
       optimization_level  march    funswitch_loops fvect_cost_model    ftree_vectorize flto    fgraphite_identity    floop_nest_optimize ffast_math    fcommon    fipa_pta     fsemantic_interposition falign_functions fgcse_sm    fgcse_las    fira_loop_pressure    flimit_function_alignment    flive_range_shrinkage    fsched_pressure    fsched_spec_load    funroll_loops    funroll_all_loops   fvariable_expansion_in_unroller    fgcse_after_reload finline_functions fipa_cp_clone floop_interchange    floop_unroll_and_jam    fpeel_loops    fpredictive_commoning    fsplit_loops fsplit_paths    ftree_loop_distribution ftree_loop_distribute_patterns    ftree_partial_pre    ftree_slp_vectorize ftree_loop_vectorize fwhole_program fuse_linker_plugin fweb frename_registers                                
808882                 O2 native fno-unswitch-loops        unlimited fno-tree-vectorize flto fno-graphite-identity fno-loop-nest-optimize ffast-math fno-common fno-ipa-pta     fsemantic-interposition               24 fgcse-sm fno-gcse-las fno-ira-loop-pressure fno-limit-function-alignment    flive-range-shrinkage fno-sched-pressure fno-sched-specload fno-unroll-loops fno-unroll-all-loops fno-variable-expansion-in-unroller    fgcse-after-reload finline-functions fipa-cp-clone floop-interchange    floop-unroll-and-jam fno-peel-loops    fpredictive-commoning fno-split-loops fsplit-paths    ftree-loop-distribution ftree-loop-distribute-patterns fno-tree-partial-pre fno-tree-slp-vectorize ftee-loop-vectorize           <NA> fuse-linker-plugin   NA                NA                    
794846                 O2 native fno-unswitch-loops        unlimited fno-tree-vectorize flto fno-graphite-identity fno-loop-nest-optimize ffast-math fno-common fno-ipa-pta     fsemantic-interposition               32 fgcse-sm fno-gcse-las fno-ira-loop-pressure fno-limit-function-alignment fno-live-range-shrinkage fno-sched-pressure fno-sched-specload fno-unroll-loops fno-unroll-all-loops fno-variable-expansion-in-unroller fno-gcse-after-reload finline-functions fipa-cp-clone floop-interchange    floop-unroll-and-jam fno-peel-loops    fpredictive-commoning fno-split-loops fsplit-paths    ftree-loop-distribution ftree-loop-distribute-patterns fno-tree-partial-pre fno-tree-slp-vectorize ftee-loop-vectorize           <NA> fuse-linker-plugin   NA                NA
791566                 O2 native fno-unswitch-loops        unlimited fno-tree-vectorize flto fno-graphite-identity fno-loop-nest-optimize ffast-math fno-common fno-ipa-pta     fsemantic-interposition               24 fgcse-sm fno-gcse-las fno-ira-loop-pressure fno-limit-function-alignment    flive-range-shrinkage    fsched-pressure fno-sched-specload fno-unroll-loops fno-unroll-all-loops fno-variable-expansion-in-unroller fno-gcse-after-reload finline-functions fipa-cp-clone floop-interchange    floop-unroll-and-jam fno-peel-loops fno-predictive-commoning    fsplit-loops fsplit-paths fno-tree-loop-distribution ftree-loop-distribute-patterns fno-tree-partial-pre fno-tree-slp-vectorize ftee-loop-vectorize           <NA> fuse-linker-plugin   NA                NA
794382                 O2 native fno-unswitch-loops        unlimited fno-tree-vectorize flto fno-graphite-identity fno-loop-nest-optimize ffast-math fno-common fno-ipa-pta  fno-semantic-interposition               32 fgcse-sm    fgcse-las fno-ira-loop-pressure fno-limit-function-alignment    flive-range-shrinkage fno-sched-pressure fno-sched-specload fno-unroll-loops fno-unroll-all-loops    fvariable-expansion-in-unroller fno-gcse-after-reload finline-functions fipa-cp-clone floop-interchange    floop-unroll-and-jam fno-peel-loops fno-predictive-commoning    fsplit-loops fsplit-paths    ftree-loop-distribution ftree-loop-distribute-patterns fno-tree-partial-pre fno-tree-slp-vectorize ftee-loop-vectorize           <NA> fuse-linker-plugin   NA                NA
799311                 O2 native fno-unswitch-loops        unlimited fno-tree-vectorize flto fno-graphite-identity fno-loop-nest-optimize ffast-math fno-common fno-ipa-pta     fsemantic-interposition               32 fgcse-sm fno-gcse-las fno-ira-loop-pressure fno-limit-function-alignment fno-live-range-shrinkage fno-sched-pressure fno-sched-specload fno-unroll-loops fno-unroll-all-loops fno-variable-expansion-in-unroller fno-gcse-after-reload finline-functions fipa-cp-clone floop-interchange    floop-unroll-and-jam fno-peel-loops    fpredictive-commoning fno-split-loops fsplit-paths    ftree-loop-distribution ftree-loop-distribute-patterns fno-tree-partial-pre fno-tree-slp-vectorize ftee-loop-vectorize           <NA> fuse-linker-plugin   NA                NA
809901                 O2 native fno-unswitch-loops        unlimited fno-tree-vectorize flto fno-graphite-identity fno-loop-nest-optimize ffast-math fno-common fno-ipa-pta     fsemantic-interposition               24 fgcse-sm fno-gcse-las fno-ira-loop-pressure fno-limit-function-alignment fno-live-range-shrinkage fno-sched-pressure fno-sched-specload fno-unroll-loops fno-unroll-all-loops fno-variable-expansion-in-unroller fno-gcse-after-reload finline-functions fipa-cp-clone floop-interchange fno-loop-unroll-and-jam fno-peel-loops fno-predictive-commoning fno-split-loops fsplit-paths    ftree-loop-distribution ftree-loop-distribute-patterns fno-tree-partial-pre fno-tree-slp-vectorize ftee-loop-vectorize           <NA> fuse-linker-plugin   NA                NA
809090                 O2 native fno-unswitch-loops        unlimited fno-tree-vectorize flto fno-graphite-identity fno-loop-nest-optimize ffast-math fno-common fno-ipa-pta     fsemantic-interposition               16 fgcse-sm    fgcse-las fno-ira-loop-pressure fno-limit-function-alignment fno-live-range-shrinkage fno-sched-pressure fno-sched-specload fno-unroll-loops fno-unroll-all-loops    fvariable-expansion-in-unroller fno-gcse-after-reload finline-functions fipa-cp-clone floop-interchange fno-loop-unroll-and-jam fno-peel-loops fno-predictive-commoning    fsplit-loops fsplit-paths    ftree-loop-distribution ftree-loop-distribute-patterns fno-tree-partial-pre fno-tree-slp-vectorize ftee-loop-vectorize           <NA> fuse-linker-plugin   NA                NA

I'm running on a Skylake-X with AVX-512 so a big take away is -fvect_cost_model=unlimited is better than the defaults for -O2 or -O3. for these benchmarks, LTO seems like a winner, but Graphite isn't necessarily. there may be some -O3 flags that are pessimistic, but this may also be resolved by the improved inlining thresholds in 9.1 as well. also -falign_functions=24, which is a resticted version of 32, not actually "24", may be worth investigating more.

Of course the big caveats are that these are (micro)architecture and benchmark specific, I didn't get to complete the run, and I screwed up the specification of -fweb and -frename-registers so they weren't evaluated at all:)

At the same time I've been playing with IRace, I've been using Hyperfine, an execution time benchmarking harness, to evaluate performance tuning of a research program. I'm thinking it might be possible to use this, or bench which inspired it, to measure runtimes and then run IRace in deterministic mode where multiple random seeds are not used for benchmark instances.

A few Questions for the Community:

anyone have experience with these kind of tools?
anyone interested in working together to use IRace, or similar, to select flags for global Gentoo use?
Since results are highly benchmark specific, what are good, short-duration, freely available workloads? (cough don't say SPEC :sob:)
are there any synergies with PGO/autoFDO in terms of workloads or profile generation?
does it make any sense to autotune with benchmark profiling available/enabled?

additional question: if you have 10 runtime measurements, would it make more sense to use the shortest execution time (assuming the longer runs are due to system noise) than to use the average (which seems to be what hyperfine, and maybe IRace, are doing)?

Wow, this is excellent!! Any chance you could publish the mean values of the runs you listed? I notice the top entry has -fsemantic-interposition enabled too which is interesting. I'm very interested in trying this out on my own system to see what the results might be!

at the end of an iteration, IRace only spits out a mean for the best Best-so-far configuration: 808882 mean value: 0.2936677815

previously (reverse chronological):

Best-so-far configuration:      808882    mean value:    0.2931026786
Best-so-far configuration:      808882    mean value:    0.2933184358
Best-so-far configuration:      808882    mean value:    0.2929306488
Best-so-far configuration:      808882    mean value:    0.2929306488
Best-so-far configuration:      808882    mean value:    0.2932026876
Best-so-far configuration:      808882    mean value:    0.2934753363
Best-so-far configuration:      808882    mean value:    0.2937485971
Best-so-far configuration:      808882    mean value:    0.2932022472

but I'm honestly not sure what this means, is it only for the Instances (individual benchmark + random seed) that were evaluated in the Iteration, or for all that have ever been explored for a configuration? is it evenly weighted (so benchmarks with more rand seeds evaluated would be overrepresented) or is it normalized somehow?

here is an updated version of the scripts I used: gcc-O2plus.txt scenario3.txt (start with like 10-100k experiments, instead of 20 million) instances.txt runner.txt default-first.txt

you'll need to chmod +x runner.txt so it's executable by irace (had to change the extension cuz github won't let you upload *.sh/or without an extension) and emerge -av app-benchmarks/acovea dev-lang/R sys-process/time

you can install IRace globally by sudo R and then install.packages("irace") or if that doesn't work install.packages("irace_3.1.tar.gz")

then you can verify some things (but sadly not everything) will work with irace -s scenario3.txt --check. then run with irace -s scenario3.txt. you may need to uncomment the configurationsFile directive in the scenario, especially if you want to switch to time bounded execution (this is not wall clock time but the sum of all the configuration executions, so if you use parallelism it will finish much faster than you think). also you should probably adjust the parallelism.

it seemed to me that my runtimes were not noticably slower when running 16 programms in parallel on an 18 physical core machine. if you push into hyperthreading territory, your results will be pretty useless because irace doesn't always use all the parallelism available.

to explore this, you can compile some program and the run seq 1 $N|parallel time ./someprogram and see if the output of time is roughly the same as you increase N.

one more note, I saw some instance reporting a Mean best of 0.0000000, so its possible that, at least under some flags, gcc can replace one of the acovea benchmarks with a constant. if we can figure out which one, we should probably drop it.

just occurred to me that evaluating the effect of parallelism is a good use for hyperfine :)

emerge -auv cargo parallel and cargo install hyperfine then run parallel once by hand and dismiss the citation nag, then hyperfine --parameter-scan N 1 36 --prepare 'gcc -lm -lrt -O2 -march=native /usr/share/libacovea/benchmarks/treebench.c' --export-json=parallel.json 'seq 1 {N} | parallel ./a.out'

my output is like the following:

Benchmark #1: seq 1 1 | parallel ./a.out                                                                                                                            [136/4790]
  Time (mean ± σ):      1.322 s ±  0.012 s    [User: 1.292 s, System: 0.028 s]
  Range (min … max):    1.307 s …  1.342 s    10 runs

Benchmark #2: seq 1 2 | parallel ./a.out
  Time (mean ± σ):      1.323 s ±  0.013 s    [User: 2.510 s, System: 0.032 s]
  Range (min … max):    1.309 s …  1.354 s    10 runs

Benchmark #3: seq 1 3 | parallel ./a.out
  Time (mean ± σ):      1.330 s ±  0.013 s    [User: 3.744 s, System: 0.033 s]
  Range (min … max):    1.311 s …  1.353 s    10 runs

Benchmark #4: seq 1 4 | parallel ./a.out
  Time (mean ± σ):      1.331 s ±  0.012 s    [User: 4.959 s, System: 0.037 s]
  Range (min … max):    1.315 s …  1.351 s    10 runs

Benchmark #5: seq 1 5 | parallel ./a.out
  Time (mean ± σ):      1.369 s ±  0.039 s    [User: 6.240 s, System: 0.041 s]
  Range (min … max):    1.325 s …  1.448 s    10 runs

Benchmark #6: seq 1 6 | parallel ./a.out
  Time (mean ± σ):      1.354 s ±  0.015 s    [User: 7.466 s, System: 0.041 s]
  Range (min … max):    1.325 s …  1.384 s    10 runs

Benchmark #7: seq 1 7 | parallel ./a.out
  Time (mean ± σ):      1.358 s ±  0.030 s    [User: 8.690 s, System: 0.045 s]
  Range (min … max):    1.336 s …  1.437 s    10 runs

Benchmark #8: seq 1 8 | parallel ./a.out
  Time (mean ± σ):      1.365 s ±  0.023 s    [User: 9.929 s, System: 0.047 s]
  Range (min … max):    1.330 s …  1.403 s    10 runs

Benchmark #9: seq 1 9 | parallel ./a.out
  Time (mean ± σ):      1.386 s ±  0.029 s    [User: 11.178 s, System: 0.050 s]
  Range (min … max):    1.346 s …  1.445 s    10 runs

Benchmark #10: seq 1 10 | parallel ./a.out
  Time (mean ± σ):      1.368 s ±  0.033 s    [User: 12.386 s, System: 0.049 s]
  Range (min … max):    1.337 s …  1.437 s    10 runs

Benchmark #11: seq 1 11 | parallel ./a.out
  Time (mean ± σ):      1.385 s ±  0.024 s    [User: 13.642 s, System: 0.054 s]
  Range (min … max):    1.348 s …  1.428 s    10 runs

Benchmark #12: seq 1 12 | parallel ./a.out
  Time (mean ± σ):      1.378 s ±  0.025 s    [User: 14.855 s, System: 0.056 s]
  Range (min … max):    1.344 s …  1.417 s    10 runs
Benchmark #13: seq 1 13 | parallel ./a.out
  Time (mean ± σ):      1.381 s ±  0.020 s    [User: 16.100 s, System: 0.056 s]
  Range (min … max):    1.355 s …  1.420 s    10 runs                         

Benchmark #14: seq 1 14 | parallel ./a.out
  Time (mean ± σ):      1.381 s ±  0.024 s    [User: 17.288 s, System: 0.064 s]
  Range (min … max):    1.355 s …  1.421 s    10 runs                         

Benchmark #15: seq 1 15 | parallel ./a.out
  Time (mean ± σ):      1.384 s ±  0.028 s    [User: 18.541 s, System: 0.064 s]
  Range (min … max):    1.341 s …  1.424 s    10 runs                         

Benchmark #16: seq 1 16 | parallel ./a.out
  Time (mean ± σ):      1.371 s ±  0.019 s    [User: 19.725 s, System: 0.067 s]
  Range (min … max):    1.351 s …  1.399 s    10 runs                         

Benchmark #17: seq 1 17 | parallel ./a.out
  Time (mean ± σ):      1.389 s ±  0.034 s    [User: 20.982 s, System: 0.065 s]
  Range (min … max):    1.340 s …  1.449 s    10 runs                         

Benchmark #18: seq 1 18 | parallel ./a.out
  Time (mean ± σ):      1.376 s ±  0.019 s    [User: 22.178 s, System: 0.072 s]
  Range (min … max):    1.350 s …  1.403 s    10 runs                         

Benchmark #19: seq 1 19 | parallel ./a.out
  Time (mean ± σ):      1.709 s ±  0.009 s    [User: 24.137 s, System: 0.073 s]
  Range (min … max):    1.701 s …  1.728 s    10 runs

snip

Summary                                                                        
  'seq 1 1 | parallel ./a.out' ran                                            
    1.00 ± 0.01 times faster than 'seq 1 2 | parallel ./a.out'
    1.01 ± 0.01 times faster than 'seq 1 3 | parallel ./a.out'                 
    1.01 ± 0.01 times faster than 'seq 1 4 | parallel ./a.out'                 
    1.02 ± 0.01 times faster than 'seq 1 6 | parallel ./a.out'                
    1.03 ± 0.02 times faster than 'seq 1 7 | parallel ./a.out'
    1.03 ± 0.02 times faster than 'seq 1 8 | parallel ./a.out'                 
    1.03 ± 0.03 times faster than 'seq 1 10 | parallel ./a.out'                
    1.04 ± 0.03 times faster than 'seq 1 5 | parallel ./a.out'                
    1.04 ± 0.02 times faster than 'seq 1 16 | parallel ./a.out'
    1.04 ± 0.02 times faster than 'seq 1 18 | parallel ./a.out'                
    1.04 ± 0.02 times faster than 'seq 1 12 | parallel ./a.out'                
    1.04 ± 0.02 times faster than 'seq 1 14 | parallel ./a.out'               
    1.04 ± 0.02 times faster than 'seq 1 13 | parallel ./a.out'
    1.05 ± 0.02 times faster than 'seq 1 15 | parallel ./a.out'                
    1.05 ± 0.02 times faster than 'seq 1 11 | parallel ./a.out'                
    1.05 ± 0.02 times faster than 'seq 1 9 | parallel ./a.out'                
    1.05 ± 0.03 times faster than 'seq 1 17 | parallel ./a.out'
    1.29 ± 0.01 times faster than 'seq 1 19 | parallel ./a.out'                
    1.29 ± 0.01 times faster than 'seq 1 20 | parallel ./a.out'                
    1.30 ± 0.01 times faster than 'seq 1 21 | parallel ./a.out'               
    1.31 ± 0.01 times faster than 'seq 1 22 | parallel ./a.out'
    1.31 ± 0.01 times faster than 'seq 1 23 | parallel ./a.out'                
    1.31 ± 0.01 times faster than 'seq 1 24 | parallel ./a.out'                
    1.31 ± 0.02 times faster than 'seq 1 25 | parallel ./a.out'                
    1.32 ± 0.01 times faster than 'seq 1 26 | parallel ./a.out'
    1.34 ± 0.02 times faster than 'seq 1 27 | parallel ./a.out'                
    1.38 ± 0.01 times faster than 'seq 1 28 | parallel ./a.out'                
    1.38 ± 0.01 times faster than 'seq 1 29 | parallel ./a.out'                
    1.39 ± 0.01 times faster than 'seq 1 30 | parallel ./a.out'
    1.39 ± 0.01 times faster than 'seq 1 32 | parallel ./a.out'                
    1.39 ± 0.01 times faster than 'seq 1 31 | parallel ./a.out'                
    1.39 ± 0.01 times faster than 'seq 1 34 | parallel ./a.out'                
    1.39 ± 0.01 times faster than 'seq 1 33 | parallel ./a.out'
    1.40 ± 0.01 times faster than 'seq 1 35 | parallel ./a.out'                
    1.40 ± 0.01 times faster than 'seq 1 36 | parallel ./a.out'

so parallelism up to the physical number of cores adds about %5 variation, but since some benchmarks run in < 100ms thats actually well within the margin of error and may have been affecting my results.

for longer running programs the best case execution times varied by <1% up to the max number of cores, but the worst case fluctuated wildly. I think the linux scheduler isn't good at preferring free cores over SMT siblings of in-use cores. after I finish re-emerging world I'll reboot and turn off hyperthreading for some more benchmarking. I'll see if I can adjust the benchmark programs to all be closer to 1 sec runtimes, unless you @InBetweenNames have ideas for a better benchmark collection?

looks like it's definitely better to benchmark with hyperthreading off.

started up another run of irace with the addition of -fplt, the ffast-math subset from Clear, working fweb and frename-registers and some extended alignment options.

I'll post updated config files tonight.

seems that almabench.c is the one that leads to 0 sec runtimes. dropping it from the config. here's the latest: gcc-O2plus.txt runner.txt scenario3.txt instances.txt default-second.txt

InBetweenNames / gentooLTO

[Optimization] "automatic" compiler flag tuning AKA getting sciencey with CFLAGS #288