Open wolfwood opened 5 years ago
additional question: if you have 10 runtime measurements, would it make more sense to use the shortest execution time (assuming the longer runs are due to system noise) than to use the average (which seems to be what hyperfine, and maybe IRace, are doing)?
Wow, this is excellent!! Any chance you could publish the mean values of the runs you listed? I notice the top entry has -fsemantic-interposition
enabled too which is interesting. I'm very interested in trying this out on my own system to see what the results might be!
at the end of an iteration, IRace only spits out a mean for the best
Best-so-far configuration: 808882 mean value: 0.2936677815
previously (reverse chronological):
Best-so-far configuration: 808882 mean value: 0.2931026786
Best-so-far configuration: 808882 mean value: 0.2933184358
Best-so-far configuration: 808882 mean value: 0.2929306488
Best-so-far configuration: 808882 mean value: 0.2929306488
Best-so-far configuration: 808882 mean value: 0.2932026876
Best-so-far configuration: 808882 mean value: 0.2934753363
Best-so-far configuration: 808882 mean value: 0.2937485971
Best-so-far configuration: 808882 mean value: 0.2932022472
but I'm honestly not sure what this means, is it only for the Instances (individual benchmark + random seed) that were evaluated in the Iteration, or for all that have ever been explored for a configuration? is it evenly weighted (so benchmarks with more rand seeds evaluated would be overrepresented) or is it normalized somehow?
here is an updated version of the scripts I used: gcc-O2plus.txt scenario3.txt (start with like 10-100k experiments, instead of 20 million) instances.txt runner.txt default-first.txt
you'll need to chmod +x runner.txt
so it's executable by irace (had to change the extension cuz github won't let you upload *.sh/or without an extension) and emerge -av app-benchmarks/acovea dev-lang/R sys-process/time
you can install IRace globally by sudo R
and then install.packages("irace")
or if that doesn't work install.packages("irace_3.1.tar.gz")
then you can verify some things (but sadly not everything) will work with irace -s scenario3.txt --check
. then run with irace -s scenario3.txt
. you may need to uncomment the configurationsFile directive in the scenario, especially if you want to switch to time bounded execution (this is not wall clock time but the sum of all the configuration executions, so if you use parallelism it will finish much faster than you think). also you should probably adjust the parallelism.
it seemed to me that my runtimes were not noticably slower when running 16 programms in parallel on an 18 physical core machine. if you push into hyperthreading territory, your results will be pretty useless because irace doesn't always use all the parallelism available.
to explore this, you can compile some program and the run seq 1 $N|parallel time ./someprogram
and see if the output of time is roughly the same as you increase N.
one more note, I saw some instance reporting a Mean best of 0.0000000, so its possible that, at least under some flags, gcc can replace one of the acovea benchmarks with a constant. if we can figure out which one, we should probably drop it.
just occurred to me that evaluating the effect of parallelism is a good use for hyperfine :)
emerge -auv cargo parallel
and cargo install hyperfine
then run parallel once by hand and dismiss the citation nag, then hyperfine --parameter-scan N 1 36 --prepare 'gcc -lm -lrt -O2 -march=native /usr/share/libacovea/benchmarks/treebench.c' --export-json=parallel.json 'seq 1 {N} | parallel ./a.out'
my output is like the following:
Benchmark #1: seq 1 1 | parallel ./a.out [136/4790]
Time (mean ± σ): 1.322 s ± 0.012 s [User: 1.292 s, System: 0.028 s]
Range (min … max): 1.307 s … 1.342 s 10 runs
Benchmark #2: seq 1 2 | parallel ./a.out
Time (mean ± σ): 1.323 s ± 0.013 s [User: 2.510 s, System: 0.032 s]
Range (min … max): 1.309 s … 1.354 s 10 runs
Benchmark #3: seq 1 3 | parallel ./a.out
Time (mean ± σ): 1.330 s ± 0.013 s [User: 3.744 s, System: 0.033 s]
Range (min … max): 1.311 s … 1.353 s 10 runs
Benchmark #4: seq 1 4 | parallel ./a.out
Time (mean ± σ): 1.331 s ± 0.012 s [User: 4.959 s, System: 0.037 s]
Range (min … max): 1.315 s … 1.351 s 10 runs
Benchmark #5: seq 1 5 | parallel ./a.out
Time (mean ± σ): 1.369 s ± 0.039 s [User: 6.240 s, System: 0.041 s]
Range (min … max): 1.325 s … 1.448 s 10 runs
Benchmark #6: seq 1 6 | parallel ./a.out
Time (mean ± σ): 1.354 s ± 0.015 s [User: 7.466 s, System: 0.041 s]
Range (min … max): 1.325 s … 1.384 s 10 runs
Benchmark #7: seq 1 7 | parallel ./a.out
Time (mean ± σ): 1.358 s ± 0.030 s [User: 8.690 s, System: 0.045 s]
Range (min … max): 1.336 s … 1.437 s 10 runs
Benchmark #8: seq 1 8 | parallel ./a.out
Time (mean ± σ): 1.365 s ± 0.023 s [User: 9.929 s, System: 0.047 s]
Range (min … max): 1.330 s … 1.403 s 10 runs
Benchmark #9: seq 1 9 | parallel ./a.out
Time (mean ± σ): 1.386 s ± 0.029 s [User: 11.178 s, System: 0.050 s]
Range (min … max): 1.346 s … 1.445 s 10 runs
Benchmark #10: seq 1 10 | parallel ./a.out
Time (mean ± σ): 1.368 s ± 0.033 s [User: 12.386 s, System: 0.049 s]
Range (min … max): 1.337 s … 1.437 s 10 runs
Benchmark #11: seq 1 11 | parallel ./a.out
Time (mean ± σ): 1.385 s ± 0.024 s [User: 13.642 s, System: 0.054 s]
Range (min … max): 1.348 s … 1.428 s 10 runs
Benchmark #12: seq 1 12 | parallel ./a.out
Time (mean ± σ): 1.378 s ± 0.025 s [User: 14.855 s, System: 0.056 s]
Range (min … max): 1.344 s … 1.417 s 10 runs
Benchmark #13: seq 1 13 | parallel ./a.out
Time (mean ± σ): 1.381 s ± 0.020 s [User: 16.100 s, System: 0.056 s]
Range (min … max): 1.355 s … 1.420 s 10 runs
Benchmark #14: seq 1 14 | parallel ./a.out
Time (mean ± σ): 1.381 s ± 0.024 s [User: 17.288 s, System: 0.064 s]
Range (min … max): 1.355 s … 1.421 s 10 runs
Benchmark #15: seq 1 15 | parallel ./a.out
Time (mean ± σ): 1.384 s ± 0.028 s [User: 18.541 s, System: 0.064 s]
Range (min … max): 1.341 s … 1.424 s 10 runs
Benchmark #16: seq 1 16 | parallel ./a.out
Time (mean ± σ): 1.371 s ± 0.019 s [User: 19.725 s, System: 0.067 s]
Range (min … max): 1.351 s … 1.399 s 10 runs
Benchmark #17: seq 1 17 | parallel ./a.out
Time (mean ± σ): 1.389 s ± 0.034 s [User: 20.982 s, System: 0.065 s]
Range (min … max): 1.340 s … 1.449 s 10 runs
Benchmark #18: seq 1 18 | parallel ./a.out
Time (mean ± σ): 1.376 s ± 0.019 s [User: 22.178 s, System: 0.072 s]
Range (min … max): 1.350 s … 1.403 s 10 runs
Benchmark #19: seq 1 19 | parallel ./a.out
Time (mean ± σ): 1.709 s ± 0.009 s [User: 24.137 s, System: 0.073 s]
Range (min … max): 1.701 s … 1.728 s 10 runs
snip
Summary
'seq 1 1 | parallel ./a.out' ran
1.00 ± 0.01 times faster than 'seq 1 2 | parallel ./a.out'
1.01 ± 0.01 times faster than 'seq 1 3 | parallel ./a.out'
1.01 ± 0.01 times faster than 'seq 1 4 | parallel ./a.out'
1.02 ± 0.01 times faster than 'seq 1 6 | parallel ./a.out'
1.03 ± 0.02 times faster than 'seq 1 7 | parallel ./a.out'
1.03 ± 0.02 times faster than 'seq 1 8 | parallel ./a.out'
1.03 ± 0.03 times faster than 'seq 1 10 | parallel ./a.out'
1.04 ± 0.03 times faster than 'seq 1 5 | parallel ./a.out'
1.04 ± 0.02 times faster than 'seq 1 16 | parallel ./a.out'
1.04 ± 0.02 times faster than 'seq 1 18 | parallel ./a.out'
1.04 ± 0.02 times faster than 'seq 1 12 | parallel ./a.out'
1.04 ± 0.02 times faster than 'seq 1 14 | parallel ./a.out'
1.04 ± 0.02 times faster than 'seq 1 13 | parallel ./a.out'
1.05 ± 0.02 times faster than 'seq 1 15 | parallel ./a.out'
1.05 ± 0.02 times faster than 'seq 1 11 | parallel ./a.out'
1.05 ± 0.02 times faster than 'seq 1 9 | parallel ./a.out'
1.05 ± 0.03 times faster than 'seq 1 17 | parallel ./a.out'
1.29 ± 0.01 times faster than 'seq 1 19 | parallel ./a.out'
1.29 ± 0.01 times faster than 'seq 1 20 | parallel ./a.out'
1.30 ± 0.01 times faster than 'seq 1 21 | parallel ./a.out'
1.31 ± 0.01 times faster than 'seq 1 22 | parallel ./a.out'
1.31 ± 0.01 times faster than 'seq 1 23 | parallel ./a.out'
1.31 ± 0.01 times faster than 'seq 1 24 | parallel ./a.out'
1.31 ± 0.02 times faster than 'seq 1 25 | parallel ./a.out'
1.32 ± 0.01 times faster than 'seq 1 26 | parallel ./a.out'
1.34 ± 0.02 times faster than 'seq 1 27 | parallel ./a.out'
1.38 ± 0.01 times faster than 'seq 1 28 | parallel ./a.out'
1.38 ± 0.01 times faster than 'seq 1 29 | parallel ./a.out'
1.39 ± 0.01 times faster than 'seq 1 30 | parallel ./a.out'
1.39 ± 0.01 times faster than 'seq 1 32 | parallel ./a.out'
1.39 ± 0.01 times faster than 'seq 1 31 | parallel ./a.out'
1.39 ± 0.01 times faster than 'seq 1 34 | parallel ./a.out'
1.39 ± 0.01 times faster than 'seq 1 33 | parallel ./a.out'
1.40 ± 0.01 times faster than 'seq 1 35 | parallel ./a.out'
1.40 ± 0.01 times faster than 'seq 1 36 | parallel ./a.out'
so parallelism up to the physical number of cores adds about %5 variation, but since some benchmarks run in < 100ms thats actually well within the margin of error and may have been affecting my results.
for longer running programs the best case execution times varied by <1% up to the max number of cores, but the worst case fluctuated wildly. I think the linux scheduler isn't good at preferring free cores over SMT siblings of in-use cores. after I finish re-emerging world I'll reboot and turn off hyperthreading for some more benchmarking. I'll see if I can adjust the benchmark programs to all be closer to 1 sec runtimes, unless you @InBetweenNames have ideas for a better benchmark collection?
looks like it's definitely better to benchmark with hyperthreading off.
started up another run of irace with the addition of -fplt, the ffast-math subset from Clear, working fweb and frename-registers and some extended alignment options.
I'll post updated config files tonight.
seems that almabench.c is the one that leads to 0 sec runtimes. dropping it from the config. here's the latest: gcc-O2plus.txt runner.txt scenario3.txt instances.txt default-second.txt
An implicit part of this project, especially as we look at borrowing tricks for Clear Linux, is identifying appropriate compiler flags for global use (also we have the power to tune flags per-package, but that doesn't seem to be the main goal).
There are a host of gcc flags available and some may only benefit certain programs and harm others. The idea behind autotuning is to have a collection of small benchmark programs and run a search in flagspace to minimize the collective runtimes of the benchmarks, i.e. find the best flags that usually work and don't carry large penalties when they don't.
The first such tool I encountered, many years ago, was Acovea, which used a genetic algorithm to drive its search. I was never particularly satisfied with its results, but one nice idea that it had was to identify flags that were particularly influential (good or bad) because many flags are barely perceptible. I got the bug to check it out again recently, but a few limitations were grating:
So, by chance I found a talk on using IRace, a general purpose optimizer to tune GCC flags.
A few other packages with similar aims are mentioned in the talk but the author claims that these approaches are non-optimal/slower/tightly coupled to the compiler.
IRace seems to address most of my critiques of Acovea:
I used IRace for a few weeks to tune flags for a research program. I then used this learning to set up a run to tune GCC 8.3 flags for the Acovea benchmark set. I set -O2 -march+native as the baseline, then added all the flags enabled by -O3 and then most of the flags not enabled by -O3 that aren't related to an alternate instruction scheduler, fortran, security features, implied by ffastmath, or tunable parameters.
I was running for about a week when gcc 9.1 came out, so I decided to kill the job and upgrade. This long winded post is all to say: I thought ya'll might be interested in the preliminary results after 10,185,729 executions:
I'm running on a Skylake-X with AVX-512 so a big take away is -fvect_cost_model=unlimited is better than the defaults for -O2 or -O3. for these benchmarks, LTO seems like a winner, but Graphite isn't necessarily. there may be some -O3 flags that are pessimistic, but this may also be resolved by the improved inlining thresholds in 9.1 as well. also -falign_functions=24, which is a resticted version of 32, not actually "24", may be worth investigating more.
Of course the big caveats are that these are (micro)architecture and benchmark specific, I didn't get to complete the run, and I screwed up the specification of -fweb and -frename-registers so they weren't evaluated at all:)
At the same time I've been playing with IRace, I've been using Hyperfine, an execution time benchmarking harness, to evaluate performance tuning of a research program. I'm thinking it might be possible to use this, or bench which inspired it, to measure runtimes and then run IRace in deterministic mode where multiple random seeds are not used for benchmark instances.
A few Questions for the Community: