facebookarchive / BOLT

Binary Optimization and Layout Tool - A linux command-line utility used for optimizing performance of binaries
2.52k stars 178 forks source link

Issues with optimizing a shared object #175

Open kmod opened 3 years ago

kmod commented 3 years ago

We've been using bolt successfully on our binary, but when we compile our program with -fPIC and link as a shared object and apply bolt to it, the result doesn't work correctly. I'm not exactly sure what's going on but the two things I've noticed are:

I assume these are related and imply that we didn't get good output from bolt, but I can't be sure.

Is there anything different we should be doing for optimizing a shared object / PIC code?

Here's how we produced the files:

LD_PRELOAD=libpython3.8-pyston2.2d.so.1.0.prebolt perf record -e cycles:u -j any,u -o libpython3.8-pyston2.2d.so.1.0.perf -- ./python3 run_profile_task.py
perf2bolt -p libpython3.8-pyston2.2d.so.1.0.perf -o libpython3.8-pyston2.2d.so.1.0.fdata libpython3.8-pyston2.2d.so.1.0.prebolt
llvm-bolt libpython3.8-pyston2.2d.so.1.0.prebolt -o libpython3.8-pyston2.2d.so.1.0 -data=pyston/build/cpython_dbgshared_install/usr/lib/libpython3.8-pyston2.2d.so.1.0.fdata -update-debug-sections -reorder-blocks=cache+ -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -frame-opt=hot -use-gnu-stack
./python3 -c 'print("success")'

# This works:
LD_PRELOAD=libpython3.8-pyston2.2d.so.1.0.prebolt ./python3 -c 'print("success")'

Here are the files, let me know if there's any other info that I could provide that would be helpful.

maksfb commented 3 years ago

Thanks for reporting the issue.

The only known thing to not work with .so's is -split-eh, but that should be turned off automatically and you will see a warning. -inline-all can mess debug info, but likely to a limited extend. I would start with disabling all optimizations but code ordering and check if the binary works. I would check it myself, but I need to setup a virtual machine first.

kmod commented 3 years ago

Oh good idea, I removed all the command line flags:

$ llvm-bolt libpython3.8-pyston2.2d.so.1.0.prebolt -o libpython3.8-pyston2.2d.so.1.0
BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 0c14e20238604a4c05e174e71676857d45c60a0f
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0x600000, offset 0x600000
BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it.
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling -align-macro-fusion=all since no profile was specified
BOLT-INFO: enabling lite mode
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _PyEval_EvalFrameDefault
BOLT-INFO: 0 out of 7274 functions in the binary (0.0%) have non-empty execution profile
BOLT-INFO: the input contains 831 (dynamic count : 0) opportunities for macro-fusion optimization that are going to be fixed
BOLT-INFO: UCE removed 0 blocks and 0 bytes of code.
BOLT-INFO: SCTC: patched 2 tail calls (2 forward) tail calls (0 backward) from a total of 2 while removing 0 double jumps and removing 2 basic blocks totalling 10 bytes of code. CTCs total execution count is 0 and the number of times CTCs are taken is 0.
BOLT-INFO: patched build-id (flipped last bit)

And the result still crashes:

$ ./python3 -c '1'
python3: ../../../Objects/dictobject.c:883: lookdict_unicode_nodummy: Assertion `ix != DKIX_DUMMY' failed.

It also crashes if I still pass the profile file.

maksfb commented 3 years ago

Thanks for trying that. I will take a look.

kmod commented 3 years ago

When I pass the --update-debug-sections flag and no other flags, the source locations are correct now, but there are still a couple bad frames in the gdb backtrace. I believe that one of the two functions in question is _PyEval_EvalFrameDefault, which was mentioned during the bolt run as being notable for having a PIC jump table, in case that's helpful.

maksfb commented 3 years ago

There is an issue with what looks like a computed goto in _PyEval_EvalFrameDefault. I suspect the effect is limited to just this function (interpreter loop?), so you can try to disable its optimization with -skip-funcs=_PyEval_EvalFrameDefault while I think of a proper solution.

kmod commented 3 years ago

That didn't quite do it, but after skipping every function mentioned by BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function XXX I got things working.

Just in case it's relevant, we compile _PyEval_EvalFrameDefault with -Os

maksfb commented 3 years ago

That's good to know. Although, it's quite unexpected. You can also disable processing functions with jump tables using -jump-tables=none option.