Closed GammaPi closed 2 years ago
Scheme 1 tests the minimal overhead (Overhead only caused by jmp and instruction cache miss). Compared to the previous implementation, this version eliminates the need to check whether the address is resolved or not and bypasses register saving by imitating GOT mechanism.
I previously tried to mix C and assembly code and achieve the same effect, but it's hard to work with the compiler generated code in this case.
Allocating callIdSaver per symbol is unavoidable
jmp callIdSaver[funcAId];
;jmp to GOT
mov GOT -> r11 ;We cannot use direct jump because we jump from other library to Scaler library rather than the original address space
jmp (r11);
;Invoke ld caller
push $funcID
mov FIRST_PLT -> r11; We cannot use direct jump because we jump from Scaler library to other library
jmp *r11;
For resolved symbol: Leave function address as is. For unresolved symbol: Put the address of $funcID there
Scheme 1 overhead is ~2% average on PARSEC
On top of scheme1, scheme 2 adds counting information. Basically, we could use atomic add instruction to record counting data into a non-thread-local array. This bypasses the complexity of accessing the thread local array, which is very difficult to work in both C and assembly.
Put ld saver address directly into fields that are not resolved. For resolved symbol: Leave function address as is. For unresolved symbol: Put the address of $funcID there
jmp callIdSaver[funcAId];
Save used register ;Counting only register data saving, use r11 to avoid any data saving?
atomic counting
Restore used register.
;jmp to GOT
mov GOT -> r11 ;We cannot use direct jump because the code is in Scaler's library rather than the original address space
jmp *r11;
;Invoke ld caller
push $funcID
mov FIRST_PLT -> r11;yWe cannot use direct jump because the code is in Scaler's library rather than the original address space
jmp *r11;
Scheme 2 overhead is ~190x in average on PARSEC. The problem is caused by a single add instruction. Whether it's atomic or non-atomic, it will slow down the whole program by a lot. Probably a cache problem. It was fixed in recent commits. The overhead is around ~2%
atomic add instruction will slow down things a lot so it should not be used.
On top of scheme 1 and 2, scheme 3 makes counting work with timing. The timing part needs after hook so it's better to use C code. If I use assembly this part will be super complex to write.
Same as scheme2 + Fill TLS offset Fill variable offset Detailed code will be available after implementation finishes.
In current implementation, we calculate how much time an API actually takes to execute (including all children API invocations), rather than excluding all children API invocations to calculate "self-time". This makes it possible to switch between counting and timing mode as long as we can accurately exclude all counting overhead. If we need to calculate the exact self-time, then switching some invocations to counting only will make self-time not reliable because we cannot exclude all children invocations in counting mode from their parent.
We define gap variable as "The number of counting invocations between two timing rounds". In this case we only need to fetch and store one memory location in counting mode. In each timing round, we will add this number back to the total counts in C.
This gap variable naturally records how many counting rounds should we deduct from parent.
For counting mode, we do not need to store timing related information into the stack. The implementation is simple enough to be written in pure assembly. We have control over everything so that is why this method is fast.
On top of scheme3, we use direct jmp to save the space of common instructions. This is also good for leveraging the instruction cache.
Scheme 4:
Use a micro-benchmark to test overhead in the end and make time result accurate.
The new idea was implemented in v0.2.0
60 related. It looks like the overhead is caused by cache miss and unnecessary instruction. The following schemes might help improve efficiency.