facebookarchive / BOLT

Binary Optimization and Layout Tool - A linux command-line utility used for optimizing performance of binaries
2.51k stars 178 forks source link

LLVM ERROR: Undefined temporary symbol #33

Open J-cztery opened 6 years ago

J-cztery commented 6 years ago

A binary compiled on Intel Compiler with -ffreestanding to get rid of __intel memcpy replacement.

build/bin/llvm-bolt ./prog -o prog.bolt -data=./perf.fdata -report-stale -reorder-blocks=cache+ -reorder-functions=hfsort+ -split-functions=3 -split-all-cold -split-eh -dyno-stats

BOLT-INFO: Target architecture: x86_64
BOLT-INFO: first alloc address is 0x400000
BOLT-INFO: creating new program header table at address 0x800000, offset 0x400000
BOLT-INFO: enabling relocation mode
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZL28read_encoded_value_with_basehmPKhPm/eh_personality.o/1(*2)
BOLT-INFO: Functions with stale profile:
(...)
BOLT-INFO: 41 functions out of 2700 simple functions (1.5%) have non-empty execution profile.
BOLT-INFO: 15 non-simple function(s) have profile.
BOLT-INFO: 7 (17.1% of all profiled) functions have invalid (possibly stale) profile. Use -report-stale to see the list.
BOLT-INFO: profile for 1 objects was ignored
BOLT-WARNING: 72 functions will trap on entry (use -v=1 to see the list).
BOLT-INFO: the input contains 148 (dynamic count : 393) missed opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: removed 77 'repz' prefixes with estimated execution count of 0 times.
BOLT-INFO: basic block reordering modified layout of 16 (0.55%) functions
BOLT-INFO: UCE removed 0 blocks and 0 bytes of code.
BOLT-INFO: running hfsort+ for 41 functions
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:

            51231450 : executed forward branches
            20845197 : taken forward branches
           321857469 : executed backward branches
           285240617 : taken backward branches
            16320904 : executed unconditional branches
               68038 : all function calls
               17180 : indirect calls
               13934 : PLT calls
          9582799018 : executed instructions
          3083100318 : executed load instructions
          1919618209 : executed store instructions
                 148 : taken jump table branches
           389409823 : total branches
           322406718 : taken branches
            67003105 : non-taken conditional branches
           306085814 : taken conditional branches
           373088919 : all conditional branches

            37180730 : executed forward branches (-27.4%)
                   0 : taken forward branches (-100.0%)
           335908189 : executed backward branches (+4.4%)
           299291394 : taken backward branches (+4.9%)
                 194 : executed unconditional branches (-100.0%)
               68038 : all function calls (=)
               17180 : indirect calls (=)
               13934 : PLT calls (=)
          9566476249 : executed instructions (-0.2%)
          3083100318 : executed load instructions (=)
          1919618209 : executed store instructions (=)
                 148 : taken jump table branches (=)
           373089113 : total branches (-4.2%)
           299291588 : taken branches (-7.2%)
            73797525 : non-taken conditional branches (+10.1%)
           299291394 : taken conditional branches (-2.2%)
           373088919 : all conditional branches (=)

BOLT-INFO: SCTC: patched 37 tail calls (37 forward) tail calls (0 backward) from a total of 37 while removing 0 double jumps and removing 35 basic blocks totalling 175 bytes of code. CTCs total execution count is 114 and the number of times CTCs are taken is 98.
LLVM ERROR: Undefined temporary symbol
maksfb commented 6 years ago

Could you try adding -jump-tables=none option to BOLT?

J-cztery commented 6 years ago

With jump-tables=none i get:

BOLT is unable to proceed because it couldn't properly understand this function.
If you are running the most recent version of BOLT, you may want to report this and paste this dump.
Please check that there is no sensitive contents being shared in this dump.

Not sure how to check that there is no sensitive contents in the dump...

maksfb commented 6 years ago

I may have a new version for you to try soon. Meanwhile, could you add -relocs=0 and remove -reorder-functions=hfsort+ and see if it helps?

J-cztery commented 6 years ago

Great! I was able to bolt this binary with -relocs=0 before but i saw no improvements and I understood it might give some speedups with relocs enabled. So i wanted to give it a go, even though i know this piece of code is bandwidth memory bound. But that is the only think that I can make no-PIE/PIC.

maksfb commented 6 years ago

Unless the application is bound by a CPU front end (I$, iTLB), we typically don't expect noticeable gains. Sometimes you can get lucky with code layout that affects the BTB hardware. Macro-fusion alignment might help if the original code was badly aligned.

Once we add full PIC/PIE support you can try BOLT on the rest of the code.

J-cztery commented 6 years ago

My code is not front end bound, however i do see a signifficant number of stalls caused by ICache misses and I page walks. Yeah. Let me know when you have something that i could try. Thanks.