The invocation listener of frida gum has some performance issue:
g_array_set_size
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ns/call ns/call name
51.28 0.20 0.20 _frida_g_array_set_size
15.38 0.26 0.06 _gum_function_context_begin_invocation
7.69 0.29 0.03 main
5.13 0.31 0.02 _gum_function_context_end_invocation
5.13 0.33 0.02 _init
5.13 0.35 0.02 gum_invocation_stack_push
5.13 0.37 0.02 plus
According to my test, call to g_array_set_size takes half time of the whole run.
With more investigation, the call to this function happens at gum_invocation_stack_push and gum_invocation_stack_pop, where frida gum uses g_array to maintain a call stack, and will push or pop elements from the stack when entering _gum_function_context_begin_invocation or _gum_function_context_begin_invocation. So g_array_set_size will be called at least twice each time the hooked function was called.
A memset might be used to clean the extra elements, so I think this is one of the cause of the performance issues.
Output of perf top also proves this
pthread_setspecific
Test with multiple threads, pthread_setspecific costs 14.09% of total instruction reads, which is even more than _gum_function_context_begin_invocation itself.
pthread_setspecific and pthread_getspecific and APIs to read and write thread local variables, frida gum uses them (and g_private, which also calls the posix apis) to maintain a per-thread state, and access them frequently in invocation begin and end.
Output of perf top proves this
Atomic instructions
In the implementation of _gum_function_context_begin_invocation, an atomic instruction lock incl (%rax) takes about 25% time usage of the function
When using multiple threads (20), atomic instructions cause larger performance decrease:
The invocation listener of frida gum has some performance issue:
g_array_set_size
According to my test, call to
g_array_set_size
takes half time of the whole run.With more investigation, the call to this function happens at
gum_invocation_stack_push
andgum_invocation_stack_pop
, where frida gum usesg_array
to maintain a call stack, and will push or pop elements from the stack when entering_gum_function_context_begin_invocation
or_gum_function_context_begin_invocation
. Sog_array_set_size
will be called at least twice each time the hooked function was called.According to the implementation listed:
A memset might be used to clean the extra elements, so I think this is one of the cause of the performance issues.
Output of perf top also proves this
pthread_setspecific
Test with multiple threads,
pthread_setspecific
costs 14.09% of total instruction reads, which is even more than_gum_function_context_begin_invocation
itself.pthread_setspecific
andpthread_getspecific
and APIs to read and write thread local variables, frida gum uses them (and g_private, which also calls the posix apis) to maintain a per-thread state, and access them frequently in invocation begin and end.Output of perf top proves this
Atomic instructions
In the implementation of
_gum_function_context_begin_invocation
, an atomic instructionlock incl (%rax)
takes about 25% time usage of the functionWhen using multiple threads (20), atomic instructions cause larger performance decrease: