GammaPi commented 1 year ago

Approach1: Centralized counter

GammaPi commented 1 year ago

We should maintain a thread attribution clock.

GammaPi commented 1 year ago

This thread attribution clock cannot be applied during every API invocation. Because doing so will incur expensive atomic operation. We must tolerate some inaccuracies.

GammaPi commented 1 year ago

Another source of inaccuracy comes from thread sleep.

GammaPi commented 1 year ago

This approach attributes API runtime by tracking an active thread counter (attributed at API ending time). However, we cannot do the same for application. The application can only be attributed at thread creation/termination time. Such difference will cause inaccurate results. As shown in previous image.

To make two attributions consistent. We need to perform the same strategy for APIs as application, as shown in previous image.

However, doing so will need more synchronization. If the synchronization overhead is large we cannot sell this work. There are several possible ways to solve this: https://stackoverflow.com/questions/61237650/a-readers-writer-lock-without-having-a-lock-for-the-readers

We currently paused the implementation due to time concerns. Another reason is that Scaler is inherently inaccurate (because of thread_sleep?!) so there is no need to implement an accurate attribution approach.

GammaPi commented 1 year ago

Approach 2: Rough attribution (Current)

The approach simply classifies parallel and serial phase and scale parallel phase with a phase-dependent number calculated based on the thread count.

Detailed implementation is described as follows:

We keep track of the maximum thread number in a period. Specifically, we keep an increasing counter to track thread creation and termination and record the maximum thread number observed. When the thread number went down to 1, we reset the counter to 1.
Each API invocation time will be divided by the counter defined in 1.
Note that we also need to perform similar attribution strategy for thread execution time in order to calculate "self-time" in the end. For thread runtime, we keep a per-thread counter to record the start and stop time. Scaler interception points are shown in previous graph. From this graph we can calculate time for each "App" segment. And we just divide the "App" segment by the counter defined in 1.

The main problem of this approach is inaccuracy.

GammaPi commented 1 year ago

Outlier removal is currently removed because it's a very heuristic approach. Thread attribution approach 2 has just been implemented.

GammaPi commented 1 year ago

Approach 1 has been implemented succesfully. The verification of approach 1.

UTSASRG / Scaler

Implement and sell outlier removal and thread attribution. #86

Approach1: Centralized counter

Approach 2: Rough attribution (Current)