faster-cpython / ideas

1.67k stars 49 forks source link

Investigate whether BOLT reduces benchmark variability #578

Open mdboom opened 1 year ago

mdboom commented 1 year ago

@itamaro suggested that using BOLT may reduce benchmarking variability. We should run the A/A tests in this mode to see if it helps.

mdboom commented 1 year ago

Locally (Debian bullseye, with the llvm packages from here), I seem to be segfaulting llvm-bolt:

 #0 0x000055c63a2ba081 (/usr/bin/llvm-bolt+0x1adc081)
 #1 0x000055c63a2b7f1c (/usr/bin/llvm-bolt+0x1ad9f1c)
 #2 0x000055c63a2ba596 (/usr/bin/llvm-bolt+0x1adc596)
 #3 0x00007ff0ed3e7f90 (/lib/x86_64-linux-gnu/libc.so.6+0x3bf90)
 #4 0x000055c63b001740 (/usr/bin/llvm-bolt+0x2823740)
 #5 0x000055c63a365be0 (/usr/bin/llvm-bolt+0x1b87be0)
 #6 0x000055c63a364413 (/usr/bin/llvm-bolt+0x1b86413)
 #7 0x000055c63a35fcfd (/usr/bin/llvm-bolt+0x1b81cfd)
 #8 0x000055c63a35f139 (/usr/bin/llvm-bolt+0x1b81139)
 #9 0x000055c63a30634b (/usr/bin/llvm-bolt+0x1b2834b)
#10 0x000055c63a2fe33b (/usr/bin/llvm-bolt+0x1b2033b)
#11 0x000055c639021ba2 (/usr/bin/llvm-bolt+0x843ba2)
#12 0x00007ff0ed3d318a __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:74:3
#13 0x00007ff0ed3d3245 call_init ./csu/../csu/libc-start.c:128:20
#14 0x00007ff0ed3d3245 __libc_start_main ./csu/../csu/libc-start.c:368:5
#15 0x000055c63901fcd1 (/usr/bin/llvm-bolt+0x841cd1)
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.      Program arguments: /usr/bin/llvm-bolt ./python -o python.bolt -data=python.fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=none -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot
Segmentation fault

Will try this on our benchmarking machines (with different versions of Ubuntu), to see if I have more luck.

corona10 commented 1 year ago

Locally (Debian bullseye, with the llvm packages from here), I seem to be segfaulting llvm-bolt:

Which LLVM version are you using?

mdboom commented 1 year ago

Locally (Debian bullseye, with the llvm packages from here), I seem to be segfaulting llvm-bolt:

Which LLVM version are you using?

15.0.7. Should I try a more recent one?

corona10 commented 1 year ago

15.0.7. Should I try a more recent one?

Yes, please use at least 16.0.0+ Here is my experimentation: https://docs.google.com/presentation/d/1YTZfgaS9yqUDoIg1wryJuEdtB0ZHaDBJ_j2CK7GK5aM (page 11 - Analysis Environment)

mdboom commented 1 year ago

Thanks for the pointer. LLVM 16.0.3 produces something that works.

corona10 commented 1 year ago

Nice, and please take a look at https://github.com/faster-cpython/ideas/issues/551#issuecomment-1536410741 for your experimentation.

IIUC, for stabilizing the benchmark result, we should train(?) the binary by running pyperformance benchmark(and this is what pyston team did originally), and I expect that it will reduce the noise by reducing the l1 cache miss ratio.

mdboom commented 1 year ago

Here's the results of an A/A test of a recent CPython commit (45a9e3)

Without BOLT: ![without-bolt](https://user-images.githubusercontent.com/38294/236859621-c80287ec-ceb1-486f-99a2-53280c5215ea.png)
With BOLT: ![with-bolt](https://user-images.githubusercontent.com/38294/236859701-385ca288-2f6c-4ad3-bba3-d32c45006b5f.png)
build min 10%-ile mean 90%-ile max
no BOLT 0.87 0.97 1.00 1.03 1.18
BOLT 0.91 0.97 1.00 1.02 1.11

So, there is in fact a little less variability in the "long tail" with BOLT than non-BOLT. This is a little surprising and counter-intuitive. However, looking at the 10/90%-iles, it's almost identical, so it's not an obvious, easy win.

We'd probably see less variability by re-using the profiling data for BOLT between runs, but it's not clear how transferable those would be in the general case between builds with important changes in the source code. (Same reason we don't do that for PGO either).

mdboom commented 1 year ago

This is totally unrelated to the original purpose of this PR, but anticipating the question, these are the results on our benchmarking hardware of BOLT vs. non-BOLT, an approx 2% speedup.

BOLT vs. non-BOLT ![compare](https://user-images.githubusercontent.com/38294/236862263-738694e3-d8b7-4ff6-bc2d-b27045dcccbe.png)