Investigate the accuracy of Scaler

GammaPi commented 2 years ago

Why FGDS output different results compared to no FGDS?

Will overhead in hook handler itself take so much time that the output is no longer correct?

GammaPi commented 1 year ago

Accuracy verification:

Unhooked code may have high stds caused by other factors.

	1.csv	2.csv	3.csv	4.csv	5.csv	6.csv	7.csv	8.csv	STD of 8 CSVs
Default	3.504390116	3.491584798	3.508532119	272.301503	3.510088908	3.433923921	3.45358492	3.509445611	95.0401545
FGDS	24.24936239	18.72431027	25.02330769	13.29199969	1.887439924	24.71424347	22.7798317	24.57500176	8.159453065
TIM	3.475804846	326.4143447	15.25768487	184.2302547	3.472662643	3.501520295	2.112087954	3.458982936	121.9265711
Prehook	29.06014807	37.31480754	24.24568058	36.46476775	273.3137416	214.7557835	31.33008829	36.16574135	99.29242186
CNT	16.95709874	24.48905879	6.60064619	21.71133186	24.06005098	2.277280063	23.87319181	21.56350447	8.596299763

Screenshot from 2022-10-23 15-36-01

	1.csv	2.csv	3.csv	4.csv	5.csv	6.csv	7.csv	8.csv	STD of 8 CSVs
Default	28.30886817	26.40012384	25.89150854	65.23855114	26.62982603	342.3199058	152.9831208	25.81481144	112.2842605
FGDS	100.2068642	101.418994	100.6093796	91.83777034	18.17835717	102.5862586	102.6397518	322.9655958	87.85183654
TIM	178.5798157	172.0972293	178.3680494	166.8679869	366.1101612	168.0310557	113.0993312	282.1324722	80.74227553
Prehook	61.01001342	71.07481709	55.01362426	70.94077344	70.31728197	436.5227702	393.6104832	71.29957273	161.8176697
CNT	53.45845618	65.12362135	38.66457998	61.3045054	64.6429689	16.65176681	63.49609709	61.23075594	17.13928915

Screenshot from 2022-10-23 16-00-27

GammaPi commented 1 year ago

Outliers seems to play an important role in calculation. The data seems strange, inner timing has no API call at all.

GammaPi commented 1 year ago

Using times is not correct as it includes all threads.


static void do_sys_times(struct tms *tms)
{
    u64 tgutime, tgstime, cutime, cstime;

    thread_group_cputime_adjusted(current, &tgutime, &tgstime);
    cutime = current->signal->cutime;
    cstime = current->signal->cstime;
    tms->tms_utime = nsec_to_clock_t(tgutime);
    tms->tms_stime = nsec_to_clock_t(tgstime);
    tms->tms_cutime = nsec_to_clock_t(cutime);
    tms->tms_cstime = nsec_to_clock_t(cstime);
}

SYSCALL_DEFINE1(times, struct tms __user *, tbuf)
{
    if (tbuf) {
        struct tms tmp;

        do_sys_times(&tmp);
        if (copy_to_user(tbuf, &tmp, sizeof(struct tms)))
            return -EFAULT;
    }
    force_successful_syscall_return();
    return (long) jiffies_64_to_clock_t(get_jiffies_64());
}

#ifdef CONFIG_COMPAT
static compat_clock_t clock_t_to_compat_clock_t(clock_t x)
{
    return compat_jiffies_to_clock_t(clock_t_to_jiffies(x));
}

COMPAT_SYSCALL_DEFINE1(times, struct compat_tms __user *, tbuf)
{
    if (tbuf) {
        struct tms tms;
        struct compat_tms tmp;

        do_sys_times(&tms);
        /* Convert our struct tms to the compat version. */
        tmp.tms_utime = clock_t_to_compat_clock_t(tms.tms_utime);
        tmp.tms_stime = clock_t_to_compat_clock_t(tms.tms_stime);
        tmp.tms_cutime = clock_t_to_compat_clock_t(tms.tms_cutime);
        tmp.tms_cstime = clock_t_to_compat_clock_t(tms.tms_cstime);
        if (copy_to_user(tbuf, &tmp, sizeof(tmp)))
            return -EFAULT;
    }
    force_successful_syscall_return();
    return compat_jiffies_to_clock_t(jiffies);
}
#endif

void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
{
    struct signal_struct *sig = tsk->signal;
    u64 utime, stime;
    struct task_struct *t;
    unsigned int seq, nextseq;
    unsigned long flags;

    /*
     * Update current task runtime to account pending time since last
     * scheduler action or thread_group_cputime() call. This thread group
     * might have other running tasks on different CPUs, but updating
     * their runtime can affect syscall performance, so we skip account
     * those pending times and rely only on values updated on tick or
     * other scheduler action.
     */
    if (same_thread_group(current, tsk))
        (void) task_sched_runtime(current);

    rcu_read_lock();
    /* Attempt a lockless read on the first round. */
    nextseq = 0;
    do {
        seq = nextseq;
        flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq);
        times->utime = sig->utime;
        times->stime = sig->stime;
        times->sum_exec_runtime = sig->sum_sched_runtime;

        for_each_thread(tsk, t) {
            task_cputime(t, &utime, &stime);
            times->utime += utime;
            times->stime += stime;
            times->sum_exec_runtime += read_sum_exec_runtime(t);
        }
        /* If lockless access failed, take the lock. */
        nextseq = 1;
    } while (need_seqretry(&sig->stats_lock, seq));
    done_seqretry_irqrestore(&sig->stats_lock, seq, flags);
    rcu_read_unlock();
}

Besides, we should not use times to calculate time as it includes all threads' CPU time.

U and S time is stored in task struct

static inline bool task_cputime(struct task_struct *t,
                u64 *utime, u64 *stime)
{
    *utime = t->utime;
    *stime = t->stime;
    return false;
}

GammaPi commented 1 year ago

Should use fence for rdtsc

If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads are globally visible,1 it can execute LFENCE immediately before RDTSC. If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads and stores are globally visible, it can execute the sequence MFENCE;LFENCE immediately before RDTSC. If software requires RDTSC to be executed prior to execution of any subsequent instruction (including any memory accesses), it can execute the sequence LFENCE immediately after RDTSC.

GammaPi commented 1 year ago

RDTSC alone can incur 5% overhead in full-timing mode

Test time is 4. One with RDTSC, one without rdtsc

benchmark	pthread	libscalerhook-WithRDTSC	libscalerhook-WithOutRDTSC	Overhead
blackscholes	21.48	26.055	23.415	1.112748238
bodytrack	13.54	16.6	15.19	1.092824226
canneal	31.46	43.025	37.215	1.156119844
dedup	18.35	11.71	13.77	0.8503994191
facesim	265.5	281.945	276.595	1.01934236
ferret	58.24	59.4	59.2	1.003378378
fluidanimate	19.8	27.81	22.97	1.210709621
freqmine	34.63	58.535	50.62	1.156361122
raytrace	39.23	48.995	46.35	1.057065804
streamcluster	36.55	38.355	37.935	1.01107157
swaptions	12.42	18.57	16.37	1.134392181
vips	23.96	18.64	20.625	0.9037575758
x264	16.27	15.96	16	0.9975000001
AVERAGE				1.054282334

GammaPi commented 1 year ago

These kernel structs can be accessed using the following mechanism. But we need to process two metrics: Seconds and nanoseconds.

GammaPi commented 1 year ago

Proved through experiments that the data difference is not caused by self-overhead, but inaccurate estimates.

Besides, currently we only use RDTSC.

UTSASRG / Scaler

Investigate the accuracy of Scaler #75