TsudaKageyu / minhook

The Minimalistic x86/x64 API Hooking Library for Windows
http://www.codeproject.com/KB/winsdk/LibMinHook.aspx
Other
4.43k stars 897 forks source link

Minor performance increase, cache misalignment with MEMORY_BLOCK #80

Closed BigJim closed 3 years ago

BigJim commented 5 years ago

Not a breaking problem, but I noticed the MEMORY_BLOCK struct is not even half cache (16 byte) aligned. If you at minimal align 16 it (make the header size % 16 = 0), then you will guarantee each allocation block (used by trampolines) to be at minimal align 16 as well.

Notice that Window compile/linker tools like MSVC align functions to at least align 16 by default (padding with 0xCC or 0x90 as you know). You can see for your self in a debugger or using a dissembler like IDA Pro.

Maybe less of an performance gain on current generation CPUs but the last time I did profiling (admittedly a long time ago) the difference was just a handful of cycles for aligned vs over a hundred cycles for misaligned branch targets!

Also when it comes to math calculating a block slot address, it's a simple power of 2 for aligned vs the instructions need to multiply or divide the odd sizes of 12 or 18 bytes (for 32bit and 64bit mode respectively).

You can either just add adding compile switch padding (4 bytes for 32bit mode to round to 16, or 14 to round up to 32bytes for 64bit mode), or use a compiler specific construct that will automatically pad the structure length for you.

Note also in either case of fixing the alignment you don't even loose a slot. I.E. 4096-16/32 = 127 slots regardless if size(MEMORY_BLOCK) is 12 or 16 bytes (for 32bit mode example).

m417z commented 3 years ago

Looks like what you're suggesting is already the way MinHook works. The relevant code that prepares a newly allocated memory block can be seen here: https://github.com/TsudaKageyu/minhook/blob/248aad5d9b719e7db264d920837b96e34f9d553b/src/buffer.c#L228-L237

The code iterates over the allocated block in steps of sizeof(MEMORY_SLOT) bytes (32 or 64 depending on the architecture).

It can also be seen in the resulting memory, here's a partially used memory block: image

The block begins with a MEMORY_BLOCK structure, followed by unused MEMORY_SLOT items, all aligned to 64 bits, and all part of the pFree linked list.

Further in the memory block, you can find used MEMORY_SLOT items, which contain the trampoline and the relay function.

BigJim commented 3 years ago

It's been a while, but I'm sure it was what I saw while debugging. Look at the top of the referenced C file and see that MEMORY_BLOCK is 3 x 32bit values thus 12bytes for 32bit builds, and for 64bit builds it's 8+8+4 for 20bytes (not sure why I said 18 a year ago).

It's going to depend on what the current compiler structure pack setting is if they end up at a 16 byte aligned size or not. And note the compiler struct pack defaults are different for .C vs .CPP files.

Offhand (without testing) seems like editing the define, adding a "declspec" should force it and tells the compiler you intend to use it with 16 byte alignment regardless of the current struct pack setting: "typedef struct declspec(align(16)) _MEMORY_BLOCK"

Looks like MiniHook is MSVC dependant, otherwise might be different "__declspec" or similar for GCC et al.

m417z commented 3 years ago

That's true that the size of MEMORY_BLOCK is not a multiple of 16. But MEMORY_BLOCK is merely the header of the memory block, followed by MEMORY_SLOT items, each allocated at the next multiple of MEMORY_SLOT_SIZE. The arrangement is manual, and sizeof(MEMORY_SLOT_SIZE) is not used for calculating the offset for the data that follows. That's what I see from the code and from the actual memory layout.

Looks like MiniHook is MSVC dependant

As far as I know, MinHook can be compiled by all three major compilers.