facebookarchive / BOLT

Binary Optimization and Layout Tool - A linux command-line utility used for optimizing performance of binaries
2.51k stars 176 forks source link

Detecting and replacing calling conventions with vectorcall #95

Open LifeIsStrange opened 4 years ago

LifeIsStrange commented 4 years ago

Here's my understanding: Functions in binaries are called through the use of a calling convention. It might be stdcall or others. Since many years the fastcall convention is becoming more and more widespread as it load function arguments through registers, which is order of magnitude faster than the stack. It shows significant performance gains in benchmarcks (~25%) over older calling conventions. The trade-off is that it use more registers but actually it shouldn't use too much and modern cpu have bigger and bigger sram caches each year.

So the "easy" yet significant performance win would be to 1) detect in the Binary the pattern of the standard calling convention used and 2) if it's not fastcall, replace it with fastcall.

That being said there is a newer calling convention by Microsoft since 2013 named vectorcall. It is exactly the same as fastcall but use registers for more advanced types such as Floats, Doubles, SIMD vectors, and to some extent structs/composite types. Therefore vectorcall bring the performance advantage of fastcall to even more functions. BTW python is switching to it: https://www.python.org/dev/peps/pep-0590/

So detecting usage in Binary of fastcall and other calling conventions and replacing them with vectorcall would be even better, especially since almost no language use vectorcall (which is a shame, probably because it's existence is yet not widely known)

https://docs.microsoft.com/en-us/cpp/cpp/vectorcall?view=vs-2019

What do you think?

maksfb commented 4 years ago

x86-64/AMD64 calling convention is quite efficient as it's good at utilizing registers for argument passing and for return values . However, there's always a room for improvement and knowing application specifics can lead to a more efficient custom calling convention(s). To make the most use out of it, we'll need to add register allocator/re-allocator to BOLT and it has to be not worse than the one in the compiler.

LifeIsStrange commented 4 years ago

x86-64/AMD64 calling convention is quite efficient as it's good at utilizing registers for argument passing and for return values . I doubt that it use registers for floats/doubles/vectors/composite types

To make the most use out of it, we'll need to add register allocator/re-allocator to BOLT and it has to be not worse than the one in the compiler. Sadly I have not the skill to implement such a thing, it's just a future possible optimization that you might consider and yes it might be non-trivial.