GammaPi commented 2 years ago

60 related. It looks like the overhead is caused by cache miss and unnecessary instruction. The following schemes might help improve efficiency.

GammaPi commented 2 years ago

Scheme1:

Scheme 1 tests the minimal overhead (Overhead only caused by jmp and instruction cache miss). Compared to the previous implementation, this version eliminates the need to check whether the address is resolved or not and bypasses register saving by imitating GOT mechanism.

I previously tried to mix C and assembly code and achieve the same effect, but it's hard to work with the compiler generated code in this case.

Allocating callIdSaver per symbol is unavoidable

funcA@PLT

jmp callIdSaver[funcAId];

callIdSaver[funcAId] + ldCaller

;jmp to GOT
mov GOT -> r11 ;We cannot use direct jump because we jump from other library to Scaler library rather than the original address space
jmp (r11);
;Invoke ld caller
push $funcID
mov FIRST_PLT -> r11; We cannot use direct jump because we jump from Scaler library to other library
jmp  *r11;

Installation

For resolved symbol: Leave function address as is. For unresolved symbol: Put the address of $funcID there

Overhead

Scheme 1 overhead is ~2% average on PARSEC

GammaPi commented 2 years ago

Scheme 2:

On top of scheme1, scheme 2 adds counting information. Basically, we could use atomic add instruction to record counting data into a non-thread-local array. This bypasses the complexity of accessing the thread local array, which is very difficult to work in both C and assembly.

Installation

Put ld saver address directly into fields that are not resolved. For resolved symbol: Leave function address as is. For unresolved symbol: Put the address of $funcID there

funcA@PLT

jmp callIdSaver[funcAId];

callIdSaver[funcAId] + ldCaller

Save used register ;Counting only register data saving, use r11 to avoid any data saving?
atomic counting
Restore used register.

;jmp to GOT
mov GOT -> r11 ;We cannot use direct jump because the code is in Scaler's library rather than the original address space
jmp *r11;
;Invoke ld caller
push $funcID
mov FIRST_PLT -> r11;yWe cannot use direct jump because the code is in Scaler's library rather than the original address space
jmp  *r11;

Overhead

Scheme 2 overhead is ~190x in average on PARSEC. The problem is caused by a single add instruction. Whether it's atomic or non-atomic, it will slow down the whole program by a lot. Probably a cache problem. It was fixed in recent commits. The overhead is around ~2%

atomic add instruction will slow down things a lot so it should not be used.

GammaPi commented 2 years ago

Scheme 3:

On top of scheme 1 and 2, scheme 3 makes counting work with timing. The timing part needs after hook so it's better to use C code. If I use assembly this part will be super complex to write.

Installation

Same as scheme2 + Fill TLS offset Fill variable offset Detailed code will be available after implementation finishes.

How to switch between counting and timing?

Maintain a "gap" variable for each symbol and each thread in thread-local storage, initialize it to 1 at first.
When invocation count % gap ==0. Invoke C code to perform timing.
In timing code, calculate average or variance using an online algorithm (No storage of previous values needed)
Modify the "gap" variable based on some pre-defined thresholds. Eg: If the variance is low, increase the gap variable to 1000. After 1000 counting, we revert the gap to 1 in C code and re-calculate variance.

Is hybrid timing and counting correct?

In current implementation, we calculate how much time an API actually takes to execute (including all children API invocations), rather than excluding all children API invocations to calculate "self-time". This makes it possible to switch between counting and timing mode as long as we can accurately exclude all counting overhead. If we need to calculate the exact self-time, then switching some invocations to counting only will make self-time not reliable because we cannot exclude all children invocations in counting mode from their parent.

How to deduct counting from timing?

We define gap variable as "The number of counting invocations between two timing rounds". In this case we only need to fetch and store one memory location in counting mode. In each timing round, we will add this number back to the total counts in C.

This gap variable naturally records how many counting rounds should we deduct from parent.

Why fast?

For counting mode, we do not need to store timing related information into the stack. The implementation is simple enough to be written in pure assembly. We have control over everything so that is why this method is fast.

GammaPi commented 2 years ago

Scheme 3*:

On top of scheme3, we use direct jmp to save the space of common instructions. This is also good for leveraging the instruction cache.

GammaPi commented 2 years ago

Scheme 4:

Use a micro-benchmark to test overhead in the end and make time result accurate.

GammaPi commented 2 years ago

The new idea was implemented in v0.2.0

UTSASRG / Scaler

Test new idea to improve efficiency #61

60 related. It looks like the overhead is caused by cache miss and unnecessary instruction. The following schemes might help improve efficiency.

Scheme1:

funcA@PLT

callIdSaver[funcAId] + ldCaller

Installation

Overhead

Scheme 2:

Installation

funcA@PLT

callIdSaver[funcAId] + ldCaller

Overhead

Scheme 3:

Installation

How to switch between counting and timing?

Is hybrid timing and counting correct?

How to deduct counting from timing?

Why fast?

Scheme 3*: