Closed GammaPi closed 1 year ago
Accuracy verification:
Unhooked code may have high stds caused by other factors.
1.csv | 2.csv | 3.csv | 4.csv | 5.csv | 6.csv | 7.csv | 8.csv | STD of 8 CSVs | |
---|---|---|---|---|---|---|---|---|---|
Default | 3.504390116 | 3.491584798 | 3.508532119 | 272.301503 | 3.510088908 | 3.433923921 | 3.45358492 | 3.509445611 | 95.0401545 |
FGDS | 24.24936239 | 18.72431027 | 25.02330769 | 13.29199969 | 1.887439924 | 24.71424347 | 22.7798317 | 24.57500176 | 8.159453065 |
TIM | 3.475804846 | 326.4143447 | 15.25768487 | 184.2302547 | 3.472662643 | 3.501520295 | 2.112087954 | 3.458982936 | 121.9265711 |
Prehook | 29.06014807 | 37.31480754 | 24.24568058 | 36.46476775 | 273.3137416 | 214.7557835 | 31.33008829 | 36.16574135 | 99.29242186 |
CNT | 16.95709874 | 24.48905879 | 6.60064619 | 21.71133186 | 24.06005098 | 2.277280063 | 23.87319181 | 21.56350447 | 8.596299763 |
1.csv | 2.csv | 3.csv | 4.csv | 5.csv | 6.csv | 7.csv | 8.csv | STD of 8 CSVs | |
---|---|---|---|---|---|---|---|---|---|
Default | 28.30886817 | 26.40012384 | 25.89150854 | 65.23855114 | 26.62982603 | 342.3199058 | 152.9831208 | 25.81481144 | 112.2842605 |
FGDS | 100.2068642 | 101.418994 | 100.6093796 | 91.83777034 | 18.17835717 | 102.5862586 | 102.6397518 | 322.9655958 | 87.85183654 |
TIM | 178.5798157 | 172.0972293 | 178.3680494 | 166.8679869 | 366.1101612 | 168.0310557 | 113.0993312 | 282.1324722 | 80.74227553 |
Prehook | 61.01001342 | 71.07481709 | 55.01362426 | 70.94077344 | 70.31728197 | 436.5227702 | 393.6104832 | 71.29957273 | 161.8176697 |
CNT | 53.45845618 | 65.12362135 | 38.66457998 | 61.3045054 | 64.6429689 | 16.65176681 | 63.49609709 | 61.23075594 | 17.13928915 |
Outliers seems to play an important role in calculation. The data seems strange, inner timing has no API call at all.
Using times is not correct as it includes all threads.
static void do_sys_times(struct tms *tms)
{
u64 tgutime, tgstime, cutime, cstime;
thread_group_cputime_adjusted(current, &tgutime, &tgstime);
cutime = current->signal->cutime;
cstime = current->signal->cstime;
tms->tms_utime = nsec_to_clock_t(tgutime);
tms->tms_stime = nsec_to_clock_t(tgstime);
tms->tms_cutime = nsec_to_clock_t(cutime);
tms->tms_cstime = nsec_to_clock_t(cstime);
}
SYSCALL_DEFINE1(times, struct tms __user *, tbuf)
{
if (tbuf) {
struct tms tmp;
do_sys_times(&tmp);
if (copy_to_user(tbuf, &tmp, sizeof(struct tms)))
return -EFAULT;
}
force_successful_syscall_return();
return (long) jiffies_64_to_clock_t(get_jiffies_64());
}
#ifdef CONFIG_COMPAT
static compat_clock_t clock_t_to_compat_clock_t(clock_t x)
{
return compat_jiffies_to_clock_t(clock_t_to_jiffies(x));
}
COMPAT_SYSCALL_DEFINE1(times, struct compat_tms __user *, tbuf)
{
if (tbuf) {
struct tms tms;
struct compat_tms tmp;
do_sys_times(&tms);
/* Convert our struct tms to the compat version. */
tmp.tms_utime = clock_t_to_compat_clock_t(tms.tms_utime);
tmp.tms_stime = clock_t_to_compat_clock_t(tms.tms_stime);
tmp.tms_cutime = clock_t_to_compat_clock_t(tms.tms_cutime);
tmp.tms_cstime = clock_t_to_compat_clock_t(tms.tms_cstime);
if (copy_to_user(tbuf, &tmp, sizeof(tmp)))
return -EFAULT;
}
force_successful_syscall_return();
return compat_jiffies_to_clock_t(jiffies);
}
#endif
void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
{
struct signal_struct *sig = tsk->signal;
u64 utime, stime;
struct task_struct *t;
unsigned int seq, nextseq;
unsigned long flags;
/*
* Update current task runtime to account pending time since last
* scheduler action or thread_group_cputime() call. This thread group
* might have other running tasks on different CPUs, but updating
* their runtime can affect syscall performance, so we skip account
* those pending times and rely only on values updated on tick or
* other scheduler action.
*/
if (same_thread_group(current, tsk))
(void) task_sched_runtime(current);
rcu_read_lock();
/* Attempt a lockless read on the first round. */
nextseq = 0;
do {
seq = nextseq;
flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq);
times->utime = sig->utime;
times->stime = sig->stime;
times->sum_exec_runtime = sig->sum_sched_runtime;
for_each_thread(tsk, t) {
task_cputime(t, &utime, &stime);
times->utime += utime;
times->stime += stime;
times->sum_exec_runtime += read_sum_exec_runtime(t);
}
/* If lockless access failed, take the lock. */
nextseq = 1;
} while (need_seqretry(&sig->stats_lock, seq));
done_seqretry_irqrestore(&sig->stats_lock, seq, flags);
rcu_read_unlock();
}
Besides, we should not use times to calculate time as it includes all threads' CPU time.
U and S time is stored in task struct
static inline bool task_cputime(struct task_struct *t,
u64 *utime, u64 *stime)
{
*utime = t->utime;
*stime = t->stime;
return false;
}
Should use fence for rdtsc
If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads are globally visible,1 it can execute LFENCE immediately before RDTSC. If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads and stores are globally visible, it can execute the sequence MFENCE;LFENCE immediately before RDTSC. If software requires RDTSC to be executed prior to execution of any subsequent instruction (including any memory accesses), it can execute the sequence LFENCE immediately after RDTSC.
RDTSC alone can incur 5% overhead in full-timing mode
Test time is 4. One with RDTSC, one without rdtsc
benchmark | pthread | libscalerhook-WithRDTSC | libscalerhook-WithOutRDTSC | Overhead |
---|---|---|---|---|
blackscholes | 21.48 | 26.055 | 23.415 | 1.112748238 |
bodytrack | 13.54 | 16.6 | 15.19 | 1.092824226 |
canneal | 31.46 | 43.025 | 37.215 | 1.156119844 |
dedup | 18.35 | 11.71 | 13.77 | 0.8503994191 |
facesim | 265.5 | 281.945 | 276.595 | 1.01934236 |
ferret | 58.24 | 59.4 | 59.2 | 1.003378378 |
fluidanimate | 19.8 | 27.81 | 22.97 | 1.210709621 |
freqmine | 34.63 | 58.535 | 50.62 | 1.156361122 |
raytrace | 39.23 | 48.995 | 46.35 | 1.057065804 |
streamcluster | 36.55 | 38.355 | 37.935 | 1.01107157 |
swaptions | 12.42 | 18.57 | 16.37 | 1.134392181 |
vips | 23.96 | 18.64 | 20.625 | 0.9037575758 |
x264 | 16.27 | 15.96 | 16 | 0.9975000001 |
AVERAGE | 1.054282334 |
These kernel structs can be accessed using the following mechanism. But we need to process two metrics: Seconds and nanoseconds.
Proved through experiments that the data difference is not caused by self-overhead, but inaccurate estimates.
Besides, currently we only use RDTSC.
Why FGDS output different results compared to no FGDS?
Will overhead in hook handler itself take so much time that the output is no longer correct?