facebookarchive / BOLT

Binary Optimization and Layout Tool - A linux command-line utility used for optimizing performance of binaries
2.51k stars 176 forks source link

[Question] Optimization passes during an instrumentation processing #257

Closed yavtuk closed 2 years ago

yavtuk commented 2 years ago

Hi, Is there an issue to perform optimization passes during an instrumentation processing? A profile should be more precise to original binary if we do not perform all optimization passes during instrumentation processing. The processing time also will be reduced by applying only the necessary and sufficient set of the passes.

Thanks in advance, Alexey

rafaelauler commented 2 years ago

Do you mean running optimization passes when instrumenting a binary? aka running llvm-bolt -instrument -reorder-blocks ?

I typically don't do that. I will only run the instrumentation pass, alone. Typically you need the profile anyway for BOLT to do anything useful, and if you are instrumenting, you don't have the profile yet.

Now suppose you do have some version of a profile when instrumenting. Then, I assume the instrumented program will run faster. You could, in theory, even stop instrumenting some functions that you know are cold in order to reduce code size increase. But I never played with that.

yavtuk commented 2 years ago

Hi @rafaelauler , thanks for the answer, but my point is about the number of optimization passes during instrumentation processing. I mean when I would like to build an instrumentation binary to get profile, for example, ./llvm-bolt ./main -o ./main.instr -instrument [options] I see that the following passes are performed:

BOLT-INFO: Starting pass: instrumentation BOLT-INFO: Starting pass: print-stats BOLT-INFO: Starting pass: validate-internal-calls BOLT-INFO: Starting pass: strip-rep-ret BOLT-INFO: Starting pass: indirect-call-promotion BOLT-INFO: Starting pass: inlining BOLT-INFO: Starting pass: PLT call optimization BOLT-INFO: Starting pass: reorder-blocks BOLT-INFO: Starting pass: eliminate-unreachable BOLT-INFO: Starting pass: split-functions BOLT-INFO: Starting pass: loop-inversion-opt BOLT-INFO: Starting pass: fix-branches BOLT-INFO: Starting pass: reorder-functions BOLT-INFO: Starting pass: simplify-conditional-tail-calls BOLT-INFO: Starting pass: peepholes BOLT-INFO: Starting pass: aligner BOLT-INFO: Starting pass: reorder-data BOLT-INFO: Starting pass: finalize-functions BOLT-INFO: Starting pass: frame-optimizer BOLT-INFO: Starting pass: alloc-combiner BOLT-INFO: Starting pass: retpoline-insertion BOLT-INFO: Starting pass: assign-sections BOLT-INFO: Starting pass: patch-entries BOLT-INFO: Starting pass: inst-lowering BOLT-INFO: Starting pass: lower-annotations BOLT-INFO: Finished pass: lower-annotations

Do we need all this passes or we can use only few of them?

for example,

BOLT-INFO: Starting pass: instrumentation BOLT-INFO: Starting pass: fix-branches BOLT-INFO: Starting pass: assign-sections BOLT-INFO: Starting pass: patch-entries BOLT-INFO: Starting pass: inst-lowering BOLT-INFO: Starting pass: lower-annotations BOLT-INFO: Finished pass: lower-annotations

rafaelauler commented 2 years ago

I don't think most of these passes are really running. For example, I know that frame-optimizer, reorder blocks, etc, for sure are not running. They might be getting scheduled by binary pass manager, but the pass is probably just checking whether to run or not and then does not.

yavtuk commented 2 years ago

There is the -conservative-instrumentation option and it may be worthwhile to eliminate all unnecessary optimization techniques if this option is used.

rafaelauler commented 2 years ago

I took a closer look at this to investigate why we are printing all these passes when running with no flags, and it is true, these passes are indeed being started.

However, most of the passes in this list that are starting are doing so because of a pattern that we use when the pass has multiple operating modes. For example, in frame optimizer, we have -frame-opt-none (default, disabled), -frame-opt=hot and -frame-opt=all. The pass is then set to always run, but the first check it does, in FrameOptimizer.cpp:225, is to skip running if -frame-opt=none (the default). This check is very fast and there is no real overhead in starting this pass.

This is also true for indirect call promotion, inlining, PLT call optimization, retpoline, reorder basic blocks, reorder functions and reorder data. This accounts for all expensive passes in this last, which are not really running.

Passes that are indeed running:

Also bear in mind that all these passes are quite fast compared to instrumentation. I ran instrumentation with -time-opts on a test input that is moderately large and got:

===-------------------------------------------------------------------------=== Binary Function Pass Manager ===-------------------------------------------------------------------------=== Total Execution Time: 33.7051 seconds (13.6369 wall clock)

---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 6.8728 ( 21.1%) 1.0914 ( 99.5%) 7.9642 ( 23.6%) 7.9649 ( 58.4%) instrumentation 1.4365 ( 4.4%) 0.0054 ( 0.5%) 1.4419 ( 4.3%) 1.4419 ( 10.6%) fix-branches 0.8722 ( 2.7%) 0.0000 ( 0.0%) 0.8722 ( 2.6%) 0.8723 ( 6.4%) simplify-conditional-tail-calls 0.8076 ( 2.5%) 0.0000 ( 0.0%) 0.8076 ( 2.4%) 0.8076 ( 5.9%) validate-internal-calls 0.7675 ( 2.4%) 0.0000 ( 0.0%) 0.7675 ( 2.3%) 0.7675 ( 5.6%) lower-annotations 0.7274 ( 2.2%) 0.0000 ( 0.0%) 0.7274 ( 2.2%) 0.7275 ( 5.3%) eliminate-unreachable 0.5230 ( 1.6%) 0.0000 ( 0.0%) 0.5230 ( 1.6%) 0.5230 ( 3.8%) inst-lowering 0.1887 ( 0.6%) 0.0000 ( 0.0%) 0.1887 ( 0.6%) 0.1887 ( 1.4%) strip-rep-ret 13.1029 ( 40.2%) 0.0000 ( 0.0%) 13.1029 ( 38.9%) 0.1736 ( 1.3%) aligner 7.2614 ( 22.3%) 0.0000 ( 0.0%) 7.2614 ( 21.5%) 0.1216 ( 0.9%) finalize-functions 0.0313 ( 0.1%) 0.0000 ( 0.0%) 0.0313 ( 0.1%) 0.0313 ( 0.2%) assign-sections 0.0091 ( 0.0%) 0.0001 ( 0.0%) 0.0092 ( 0.0%) 0.0092 ( 0.1%) print-stats 0.0078 ( 0.0%) 0.0000 ( 0.0%) 0.0078 ( 0.0%) 0.0078 ( 0.1%) patch-entries 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) reorder-data 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) frame-optimizer 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) retpoline-insertion 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) alloc-combiner 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) indirect-call-promotion 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) inlining 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) reorder-functions 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) PLT call optimization 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) reorder-blocks 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) peepholes 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) loop-inversion-opt 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) split-functions 32.6082 (100.0%) 1.0969 (100.0%) 33.7051 (100.0%) 13.6369 (100.0%) Total

yavtuk commented 2 years ago

Hello @rafaelauler, thanks for the details.

You are rigtht, it doesn't make sence to exclude the passes, as example in our case this statistic is as follows (1.3GB binary file, AArch64, Golang):

(some metrics were dropped for passes which are not really running) ===-------------------------------------------------------------------------=== Binary Function Pass Manager ===-------------------------------------------------------------------------=== Total Execution Time: 6202.6147 seconds (1374.9821 wall clock)

---User Time--            ---System Time-- --User+System--    ---Wall Time---        --- Name ---   665.3315 (  11.0%)   10.1796 (   6.3%)  675.5111 (  10.9%)  676.3207 (  49.2%)    long-jmp   493.7899 (    8.2%)   10.8264 (   6.7%)  504.6163 (    8.1%)  505.2126 (  36.7%)    golang     53.8213 (    0.9%)     6.5805 (   4.1%)    60.4019 (    1.0%)    60.4882 (    4.4%)    instrumentation     27.8701 (    0.5%)     1.5198 (   0.9%)    29.3899 (    0.5%)    29.4247 (    2.1%)    eliminate-unreachable 2454.7673 (  40.6%)     0.3722 (   0.2%) 2455.1395 (  39.6%)    26.1258 (    1.9%)    FillBBsOffsets 2254.8762 (  37.3%)   85.1812 ( 52.5%) 2340.0574 (  37.7%)    24.8342 (    1.8%)    aligner     21.0565 (    0.3%)     0.4152 (    0.3%)    21.4717 (    0.3%)    21.4957 (    1.6%)    veneer-elimination     13.6939 (    0.2%)     0.5314 (    0.3%)    14.2253 (    0.2%)    14.2434 (    1.0%)    lower-annotations
      7.9121 (    0.1%)     0.6593 (    0.4%)      8.5714 (    0.1%)      8.5813 (    0.6%)    fix-branches       2.8752 (    0.0%)     0.9675 (    0.6%)      3.8427 (    0.1%)      3.8478 (    0.3%)    adr-relaxation       1.4538 (    0.0%)     0.0517 (    0.0%)      1.5054 (    0.0%)      1.5071 (    0.1%)    reorder-functions       1.0318 (    0.0%)     0.2320 (    0.1%)      1.2638 (    0.0%)      1.2653 (    0.1%)    strip-rep-ret       5.5986 (    0.1%)   44.4835 (  27.4%)    50.0821 (    0.8%)      0.6640 (    0.0%)     RemoveNops     36.0017 (    0.6%)     0.2066 (    0.1%)    36.2082 (    0.6%)      0.6430 (    0.0%)    finalize-functions 6040.4076 (100.0%) 162.2071 (100.0%) 6202.6147 (100.0%) 1374.9821 (100.0%)  Total

rafaelauler commented 2 years ago

Thanks for sharing, Alexey. Looks like long-jmp is a big problem for these larger aarch64 binaries. How large is the text size?

yavtuk commented 2 years ago

the text size is ~128MB

rafaelauler commented 2 years ago

We can implement other algorithms for long-jmp which generate worse code layout, but which will run faster. Currently long-jmp will try to automatically generate code with the least amount of stubs and that can be quite expensive (and re-iterate a few times). If you are interested in reducing that time, I would start by measuring how many times the pass is re-starting to recompute stuff (how many iterations as reported by https://github.com/facebookincubator/BOLT/blob/main/bolt/lib/Passes/LongJmp.cpp#L614 )

In the past I saw that linkers will use a different and faster approach. Instead of inserting stubs inside the functions, like we do, which will grow the code and in turn may require more stubs, we can reserve a specific, fixed area for stubs at an address that is close to a set of functions (within branch range), and work with that assuming that the area is of fixed size, and never reiterate. If we run out of space in this fixed area for stubs, we just crash (that's what GNU ld would do, at least... you had to manually increase the stub area size if your binary grew too big). It's an odd burden to put on the user (to guess the size of stubs and give a cryptic error if the binary is too big and runs out of stub space), but at least it runs much faster.

yavtuk commented 2 years ago

Yes, It's worst to add one more way to insert the stubs. Here is the statistic for the pass and the number of iterations seems not very large.

BOLT-INFO: Inserted 785940 stubs in the hot area and 0 stubs in the cold area. Shared 3188 times, iterated 4 times.

it's strange that we don't have the stubs in cold area, I need time to figure out why this happened.