[FEATURE] Performance issue of frida gum

The invocation listener of frida gum has some performance issue:

`g_array_set_size`

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ns/call  ns/call  name    
 51.28      0.20     0.20                             _frida_g_array_set_size
 15.38      0.26     0.06                             _gum_function_context_begin_invocation
  7.69      0.29     0.03                             main
  5.13      0.31     0.02                             _gum_function_context_end_invocation
  5.13      0.33     0.02                             _init
  5.13      0.35     0.02                             gum_invocation_stack_push
  5.13      0.37     0.02                             plus

According to my test, call to g_array_set_size takes half time of the whole run.

With more investigation, the call to this function happens at gum_invocation_stack_push and gum_invocation_stack_pop, where frida gum uses g_array to maintain a call stack, and will push or pop elements from the stack when entering _gum_function_context_begin_invocation or _gum_function_context_begin_invocation. So g_array_set_size will be called at least twice each time the hooked function was called.

According to the implementation listed:

A memset might be used to clean the extra elements, so I think this is one of the cause of the performance issues.

Output of perf top also proves this

`pthread_setspecific`

Test with multiple threads, pthread_setspecific costs 14.09% of total instruction reads, which is even more than _gum_function_context_begin_invocation itself.

--------------------------------------------------------------------------------
Ir                      
--------------------------------------------------------------------------------
19,585,761,005 (100.0%)  PROGRAM TOTALS

--------------------------------------------------------------------------------
Ir                      file:function
--------------------------------------------------------------------------------
2,760,002,940 (14.09%)  ./nptl/./nptl/pthread_setspecific.c:pthread_setspecific@@GLIBC_2.34 [/usr/lib/x86_64-linux-gnu/libc.so.6]
2,660,000,000 (13.58%)  frida/build/tmp-linux-x86_64/frida-gum/../../../frida-gum/gum/guminterceptor.c:_gum_function_context_begin_invocation [/root/frida-gum-test/a.out]
2,180,070,385 (11.13%)  ./gmon/./gmon/mcount.c:__mcount_internal [/usr/lib/x86_64-linux-gnu/libc.so.6]
2,000,054,740 (10.21%)  ./gmon/../sysdeps/x86_64/_mcount.S:mcount [/usr/lib/x86_64-linux-gnu/libc.so.6]
1,640,000,000 ( 8.37%)  frida/build/tmp-linux-x86_64/frida-gum/../../../frida-gum/gum/guminterceptor.c:_gum_function_context_end_invocation [/root/frida-gum-test/a.out]
1,580,020,848 ( 8.07%)  ???:main::{lambda()#1}::operator()() const'2 [/root/frida-gum-test/a.out]
1,120,000,000 ( 5.72%)  ???:0x0000000005325208 [???]

pthread_setspecific and pthread_getspecific and APIs to read and write thread local variables, frida gum uses them (and g_private, which also calls the posix apis) to maintain a per-thread state, and access them frequently in invocation begin and end.

Output of perf top proves this

Atomic instructions

In the implementation of _gum_function_context_begin_invocation, an atomic instruction lock incl (%rax) takes about 25% time usage of the function

When using multiple threads (20), atomic instructions cause larger performance decrease:

eunomia-bpf / bpftime