Closed Ayuto closed 3 years ago
Looks good to me after minimal testing!
My only concern, although I understand the reasoning behind it with this implementation, would probably be the removal of caching. I can certainly see that resulting into a significant increase in computing time for hooks that have many callbacks. More so for functions that have many parameters because the conventions are calculating their offsets on the fly by traversing the whole structure type by type aligning them, etc. which means that not only they have to be wrapped by Python every time but also add quite a lot of looping and calculating every times an argument is retrieved or mutated. Perhaps calculating those only once on convention's construction could be worth considering.
I will do more testings and profiling when I get the chance.
Can't we just add another cache?
Yeah, that would probably work very well. But first, I would also like to do some measuring, because I don't think it has a high impact on performance.
I've tested the following code just now with and without cache:
from memory import *
from time import *
from random import *
from sys import *
@Callback(Convention.CDECL, (DataType.POINTER,) * 64, DataType.VOID)
def callback(stack_data):
t = time()
for i in range(1000000):
repr(stack_data)
print(time() - t)
callback(*(Pointer(randint(0, maxsize)) for i in range(64)))
And here was the results I got:
Without cache: 94.56286525726318
With cache: 67.78554844856262
Basically, the larger the index is, the more work is done into GetArgumentPtr because every times the whole structure is recalculated in order to grab its offset, etc. I can only imagine the result with a custom convention that forwards every calls back to Python. But yeah, the cache was negating that effect to an extent, but regardless if we add a second cache as @CookStar suggested or not, I think this is definitely something we should look into optimizing. The types are known on construction, no reason to repeat that on the fly imo.
To be honest, I didn't expect such a big difference in performance. But yeah, the Python stuff steals a lot CPU time.
The offset calculation also takes a little bit time, but the cache is negating the performance impact almost completely. It's because on the first call to repr
every argument is cached and there are no more calls to GetArgumentPtr
. I improved the CDECL
convention a little bit by pre-calculating every offset and saving the results in a dynamically allocated array. GetArgumentPtr
now looks like this:
void* x86MsCdecl::GetArgumentPtr(int iIndex, CRegisters* pRegisters)
{
return (void *) (pRegisters->m_esp->GetValue<unsigned long>() + m_pOffsets[iIndex]);
}
And these are my results (just using 100000 calls):
# Many calls to stack_data
Without cache: 10.5
With cache: 7.7
Without cache and improved convention: 9.3
With cache and improved convention: 7.6
Not sure if it's worth improving it. I also changed the test code to only call repr
once, but the hooked function repeatedly:
# Many calls to the hooked function
Without cache: 12.5
With cache: 14.2
Without cache and improved convention: 11.3
With cache and improved convention: 14.0
So, if the hook is only called once, the cache obviously slows it down, because it has to be evaluated, but doesn't contain cached data.
With cache: 7.7 With cache and improved convention: 7.6
Well, that's surprisingly unexpected. I would have expected all the looping, switching, calculating, etc. at run-time to be much more costly than simply retrieving an int from base + (index * sizeof)
. Either there is something else into play here, or the compiler does a very good job at optimizing it and deserve a raise. π
So, if the hook is only called once, the cache obviously slows it down, because it has to be evaluated, but doesn't contain cached data.
Yeah, that's unfortunately the downside of caching; it's a significant improvement only if it is used, otherwise it's pretty much a waste of lookup. Though, perhaps it's a good trade-off if we consider that most plugins hook the same stuff and that, even if there is only one callback, scripters are not particularly known to re-use their data (eh? π) so that optimizes those scenarios as well. For the later, one could argue that this is their responsibility; and they would be entirely right.
However, I will completely backtrack and say that we should keep the cache removed. The reason is quite simple; while the improvement is major at 1m iterations, it is greatly less at 100k as illustrated by your metrics, and would be next to null at ~10 (which would most likely even be exaggerated to consider a server having 10 callbacks on the same hook). But SP without plugins has many single-callback hooks by default, which would be slower by anticipating the fact that maybe perhaps maybe not many plugins will manually hook the same functions. So yeah, I think the best call after all is to; if we judge a function is being hooked very often, let's just provide listeners that re-use the objects like we did for the buttons for example.
As for the PR, I've made quite a bit of testing and haven't noticed any issues and everything seems to be working fine on my side. The only issue I stumbled upon is the one referenced above that isn't directly related to this PR so I will just go ahead and approve it as is. π
Ran this code again and https://github.com/Ayuto/DynamicHooks/commit/ed07f4e2c8c41ad18972694759c64fb4825d1c8f had quite a good impact after all:
81.87904047966003
Thanks for implementing it! :)
This PR adds the possibility to access registers in a post-hook that were saved before the pre-hook was called. It fixes the issue that is explained here: https://forums.sourcepython.com/viewtopic.php?p=7542#p7542
It's up to the programmer to decide whether or not he wants to access the pre-hook state or the post-hook state, because there are certainly scenarios were you need the pre-hook state or the post-hook state.
See also: https://github.com/Ayuto/DynamicHooks/tree/pre_post_registers
The only problem I can think of (in this implementation) is that it won't work very well for recursive functions. To fix that as well, we would need a stack (similar to what we did with the return address). Oh, and if the stack is modified, it also won't fix the problem.
@jordanbriere What do you think?
Before we merge it, we definitely need to do some more tests and recompile DynamicHooks for Linux (Windows is already included in one of the commits).