UTSASRG / Scaler

GNU General Public License v2.0
4 stars 0 forks source link

Investigate the accuracy of Scaler #75

Closed GammaPi closed 1 year ago

GammaPi commented 2 years ago

Why FGDS output different results compared to no FGDS?

Will overhead in hook handler itself take so much time that the output is no longer correct?

GammaPi commented 1 year ago

Accuracy verification:

Unhooked code may have high stds caused by other factors.

  1.csv 2.csv 3.csv 4.csv 5.csv 6.csv 7.csv 8.csv STD of 8 CSVs
Default 3.504390116 3.491584798 3.508532119 272.301503 3.510088908 3.433923921 3.45358492 3.509445611 95.0401545
FGDS 24.24936239 18.72431027 25.02330769 13.29199969 1.887439924 24.71424347 22.7798317 24.57500176 8.159453065
TIM 3.475804846 326.4143447 15.25768487 184.2302547 3.472662643 3.501520295 2.112087954 3.458982936 121.9265711
Prehook 29.06014807 37.31480754 24.24568058 36.46476775 273.3137416 214.7557835 31.33008829 36.16574135 99.29242186
CNT 16.95709874 24.48905879 6.60064619 21.71133186 24.06005098 2.277280063 23.87319181 21.56350447 8.596299763

Screenshot from 2022-10-23 15-36-01

  1.csv 2.csv 3.csv 4.csv 5.csv 6.csv 7.csv 8.csv STD of 8 CSVs
Default 28.30886817 26.40012384 25.89150854 65.23855114 26.62982603 342.3199058 152.9831208 25.81481144 112.2842605
FGDS 100.2068642 101.418994 100.6093796 91.83777034 18.17835717 102.5862586 102.6397518 322.9655958 87.85183654
TIM 178.5798157 172.0972293 178.3680494 166.8679869 366.1101612 168.0310557 113.0993312 282.1324722 80.74227553
Prehook 61.01001342 71.07481709 55.01362426 70.94077344 70.31728197 436.5227702 393.6104832 71.29957273 161.8176697
CNT 53.45845618 65.12362135 38.66457998 61.3045054 64.6429689 16.65176681 63.49609709 61.23075594 17.13928915

Screenshot from 2022-10-23 16-00-27

GammaPi commented 1 year ago

Outliers seems to play an important role in calculation. The data seems strange, inner timing has no API call at all.

image

image

GammaPi commented 1 year ago

Using times is not correct as it includes all threads.


static void do_sys_times(struct tms *tms)
{
    u64 tgutime, tgstime, cutime, cstime;

    thread_group_cputime_adjusted(current, &tgutime, &tgstime);
    cutime = current->signal->cutime;
    cstime = current->signal->cstime;
    tms->tms_utime = nsec_to_clock_t(tgutime);
    tms->tms_stime = nsec_to_clock_t(tgstime);
    tms->tms_cutime = nsec_to_clock_t(cutime);
    tms->tms_cstime = nsec_to_clock_t(cstime);
}

SYSCALL_DEFINE1(times, struct tms __user *, tbuf)
{
    if (tbuf) {
        struct tms tmp;

        do_sys_times(&tmp);
        if (copy_to_user(tbuf, &tmp, sizeof(struct tms)))
            return -EFAULT;
    }
    force_successful_syscall_return();
    return (long) jiffies_64_to_clock_t(get_jiffies_64());
}

#ifdef CONFIG_COMPAT
static compat_clock_t clock_t_to_compat_clock_t(clock_t x)
{
    return compat_jiffies_to_clock_t(clock_t_to_jiffies(x));
}

COMPAT_SYSCALL_DEFINE1(times, struct compat_tms __user *, tbuf)
{
    if (tbuf) {
        struct tms tms;
        struct compat_tms tmp;

        do_sys_times(&tms);
        /* Convert our struct tms to the compat version. */
        tmp.tms_utime = clock_t_to_compat_clock_t(tms.tms_utime);
        tmp.tms_stime = clock_t_to_compat_clock_t(tms.tms_stime);
        tmp.tms_cutime = clock_t_to_compat_clock_t(tms.tms_cutime);
        tmp.tms_cstime = clock_t_to_compat_clock_t(tms.tms_cstime);
        if (copy_to_user(tbuf, &tmp, sizeof(tmp)))
            return -EFAULT;
    }
    force_successful_syscall_return();
    return compat_jiffies_to_clock_t(jiffies);
}
#endif

void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
{
    struct signal_struct *sig = tsk->signal;
    u64 utime, stime;
    struct task_struct *t;
    unsigned int seq, nextseq;
    unsigned long flags;

    /*
     * Update current task runtime to account pending time since last
     * scheduler action or thread_group_cputime() call. This thread group
     * might have other running tasks on different CPUs, but updating
     * their runtime can affect syscall performance, so we skip account
     * those pending times and rely only on values updated on tick or
     * other scheduler action.
     */
    if (same_thread_group(current, tsk))
        (void) task_sched_runtime(current);

    rcu_read_lock();
    /* Attempt a lockless read on the first round. */
    nextseq = 0;
    do {
        seq = nextseq;
        flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq);
        times->utime = sig->utime;
        times->stime = sig->stime;
        times->sum_exec_runtime = sig->sum_sched_runtime;

        for_each_thread(tsk, t) {
            task_cputime(t, &utime, &stime);
            times->utime += utime;
            times->stime += stime;
            times->sum_exec_runtime += read_sum_exec_runtime(t);
        }
        /* If lockless access failed, take the lock. */
        nextseq = 1;
    } while (need_seqretry(&sig->stats_lock, seq));
    done_seqretry_irqrestore(&sig->stats_lock, seq, flags);
    rcu_read_unlock();
}

Besides, we should not use times to calculate time as it includes all threads' CPU time.

U and S time is stored in task struct

static inline bool task_cputime(struct task_struct *t,
                u64 *utime, u64 *stime)
{
    *utime = t->utime;
    *stime = t->stime;
    return false;
}
GammaPi commented 1 year ago

Should use fence for rdtsc

If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads are globally visible,1 it can execute LFENCE immediately before RDTSC. If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads and stores are globally visible, it can execute the sequence MFENCE;LFENCE immediately before RDTSC. If software requires RDTSC to be executed prior to execution of any subsequent instruction (including any memory accesses), it can execute the sequence LFENCE immediately after RDTSC.

GammaPi commented 1 year ago

RDTSC alone can incur 5% overhead in full-timing mode

Test time is 4. One with RDTSC, one without rdtsc

benchmark pthread libscalerhook-WithRDTSC libscalerhook-WithOutRDTSC  Overhead
blackscholes 21.48 26.055 23.415 1.112748238
bodytrack 13.54 16.6 15.19 1.092824226
canneal 31.46 43.025 37.215 1.156119844
dedup 18.35 11.71 13.77 0.8503994191
facesim 265.5 281.945 276.595 1.01934236
ferret 58.24 59.4 59.2 1.003378378
fluidanimate 19.8 27.81 22.97 1.210709621
freqmine 34.63 58.535 50.62 1.156361122
raytrace 39.23 48.995 46.35 1.057065804
streamcluster 36.55 38.355 37.935 1.01107157
swaptions 12.42 18.57 16.37 1.134392181
vips 23.96 18.64 20.625 0.9037575758
x264 16.27 15.96 16 0.9975000001
AVERAGE       1.054282334
GammaPi commented 1 year ago

These kernel structs can be accessed using the following mechanism. But we need to process two metrics: Seconds and nanoseconds.

image

image

GammaPi commented 1 year ago

Proved through experiments that the data difference is not caused by self-overhead, but inaccurate estimates.

Besides, currently we only use RDTSC.