facebookarchive / BOLT

Binary Optimization and Layout Tool - A linux command-line utility used for optimizing performance of binaries
2.51k stars 176 forks source link

[BOLT][Instrumentation] optimization load/store X86::eflags #226

Closed yavtuk closed 2 years ago

yavtuk commented 2 years ago

Summary: This commit uses reviews.llvm.org/D6629 as reference to optimize load/store X86::EFLAGS for instrumentation snippet by using lahf/sahf instructions instead of pushf/popf.

facebook-github-bot commented 2 years ago

Hi @yavtuk!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

facebook-github-bot commented 2 years ago

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

facebook-github-bot commented 2 years ago

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

yavtuk commented 2 years ago

Thanks for this patch, that's very exciting! Have you measured any performance improvements?

Yes, I used the redis and initialization time at startup reduced by 2-3 times.

ayermolo commented 2 years ago

Thanks for this patch, that's very exciting! Have you measured any performance improvements?

Yes, I used the redis and initialization time at startup reduced by 2-3 times.

I am curious, what architecture is this on?

yavtuk commented 2 years ago

Thanks for this patch, that's very exciting! Have you measured any performance improvements?

Yes, I used the redis and initialization time at startup reduced by 2-3 times.

I am curious, what architecture is this on?

Hi Alexander, checked on the Intel(R) Xeon(R) Gold 6230N (Cascade lake) and it will be great if you can share with us the impact on the another ​arch families if you have the one of course. Thanks in advance

rafaelauler commented 2 years ago

Hi Alexey, thanks for updating with my suggestion! I noticed that in this new version of the patch, we are required to drop the pre-allocation (with .resize) of the whole vector containing the instructions in order to make it fit the MCPlusBuilder model for create* methods where the create method will do the allocation for you (instead of the caller, which is the Instrumentation class in this case). That is actually pretty expensive because we are invoking malloc several times to resize the vector as we go. I measured the time required to run the instrumentation pass in one of our benchmarks, and the instrumentation pass time increases from 33s to 56s.

So let's go back to your original patch, but do some effort to make it clear that these methods are either X86 only or write a more generic, processor-independent description by changing createLahf to be named createX86Lahf, createSahf to be createX86Sahf, createClearRegisterByMovInstr to be createClearRegWithNoFlagsUpdate.

yavtuk commented 2 years ago

Hi Alexey, thanks for updating with my suggestion! I noticed that in this new version of the patch, we are required to drop the pre-allocation (with .resize) of the whole vector containing the instructions in order to make it fit the MCPlusBuilder model for create* methods where the create method will do the allocation for you (instead of the caller, which is the Instrumentation class in this case). That is actually pretty expensive because we are invoking malloc several times to resize the vector as we go. I measured the time required to run the instrumentation pass in one of our benchmarks, and the instrumentation pass time increases from 33s to 56s.

So let's go back to your original patch, but do some effort to make it clear that these methods are either X86 only or write a more generic, processor-independent description by changing createLahf to be named createX86Lahf, createSahf to be createX86Sahf, createClearRegisterByMovInstr to be createClearRegWithNoFlagsUpdate.

Ok, no problem, I thought that it would be little bit more flexible but performance has higher priority here 🙂

yavtuk commented 2 years ago

hi @rafaelauler, the new fixes is uploaded

yavtuk commented 2 years ago

@rafaelauler give me some time to prepare another one implementation, we think it will be more readable to create two functions and move vector resizing inside the one Thanks

yavtuk commented 2 years ago

hi @rafaelauler, new implementation is uploaded, could you look into it.

yavtuk commented 2 years ago

These changes look great, Alexey, thanks for working on them! I've left a few small comments, let me know what do you think.

hi @rafaelauler fully agree, fixed and uploaded, thanks.

rafaelauler commented 2 years ago

Thanks @yavtuk ! I'll be landing this with a minor modification, fixing a typo in the comments and removing the call to "getAliasSized", replacing it with the direct reference to X86::AL and X86::RAX, since we are already in the X86 target library implementation.

I'll also add the note: "Despite some efforts to mitigate the overhead, this is known to increase the instrumentation pass runtime from 33s to about 49s in HHVM. That is because the instrumentation pass is doing more work: creating more instructions that need to be inserted in a given instrumented location. But because the generated instrumented code is faster, it pays off."

maksfb commented 2 years ago

@yavtuk I see a great (almost 2x) speedup while running an instrumented Clang-7 on Broadwell. However, the resulting .fdata is missing a tiny number of samples. I don't think it will have any effect on the performance of the optimized binary. Nevertheless, do you know why the samples could be missing and do you notice the same?

yavtuk commented 2 years ago

@maksfb So far, there have been no problems with it. Actually the main contribution to performance is related to abandon from popf instruction and I saw minor difference in profiles but since it didn’t affect the end result, I didn’t address the cause. I'll try to look at it soon

rafaelauler commented 2 years ago

I don't think clang is running multithreaded code, right? If it is, differences in fdata among two different runs are expected because indirect calls attempt to acquire a lock and fail after some time if the lock is busy.

maksfb commented 2 years ago

Right, I’m not aware of any multithreading happening in Clang. I’ll take a closer look at the mismatched branches.

maksfb commented 2 years ago

The mismatched branches are direct conditional ones.

yavtuk commented 2 years ago

Hello @maksfb, I found the few places that are making changes to a profile, but I can’t explain yet, just need a little bit more time. For the first time I left only the use of pushf/popf instructions as wrapper of incrementation and replaced TryLock back with Lock. There appear to be two obvious points where there is the dependency from the options:ConservativeInstrumentation in Instrumentation.cpp: 1) // Modified BinaryFunction:dfs() to build a spanning tree 2) // Instrument spanning tree leaves. If we comment on this functionality, the profiles will still be different, perhaps we can exclude it from consideration leave it for a while in a commented state. But there is another one place where the options::ConservativeInstrumentation impact on the profile, it's createCallDescription function.

In particular, if we remove the dependency from ConservativeInstrumentation for ForceInstrumentation, i.e. bool ForceInstrumentation = IsInvoke;
then profiles will be the same, except for the values of the counters.

Do you have any idea how that could be or we should leave it as is?

rafaelauler commented 2 years ago

I was under the impression that what was being evaluated was parent commit vs. this commit, under the same flags, that's why I didn't say anything.

But if you test this with -conservative on/off, the profile will be different and that's expected. The spanning tree-method (skipping some counters to increase speed) does lead to different profile. Ideally, it shouldn't. But it does, because we don't always build a perfect CFG. Same for instrumenting direct calls (forceinstrumentation thing).

yavtuk commented 2 years ago

@rafaelauler thanks a lot

rafaelauler commented 2 years ago

On this subject, one of the reasons the "conservative" flag was created was actually to offer a mode when profile accuracy is increased at a great cost to runtime (super slow), just to check that we are not missing anything by generating an imperfect profile.

In this diff, we don't need to put the new code sequence under "no conservative" mode because it is not causing the profile to miss anything. So if things look good, I'll commit this without the old code sequence, if that's OK.