clearlinux / distribution

Placeholder repository to allow filing of general bugs/issues/etc against the Clear Linux OS for Intel Architecture linux distribution
520 stars 29 forks source link

KUtrace : performance analysis #3168

Open kathe opened 1 month ago

kathe commented 1 month ago

Please consider including KUtrace (https://github.com/dicksites/KUtrace) by Richard "Dick" Sites. More about Richard Sites:

fenrus75 commented 1 month ago

unfortunately that project seems to not have made a release yet -- making it very hard for us to package...

On Mon, Aug 12, 2024 at 2:07 AM Kathe @.***> wrote:

Please consider including KUtrace (https://github.com/dicksites/KUtrace) by Richard "Dick" Sites.

— Reply to this email directly, view it on GitHub https://github.com/clearlinux/distribution/issues/3168, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ54FLETO3FERG4VNCY5KTZRB3N5AVCNFSM6AAAAABMLXYSSGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ3DANBTGQYTONQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

kathe commented 1 month ago

I believe there has been a release at the 6.6 mark; https://github.com/dicksites/KUtrace/tree/master/linux/linux-6.6.36 Would @dicksites (author) confirm? Also, would @dicksites please consider supporting the latest kernel? I believe it is 6.10! CC: @ephut

dicksites commented 1 month ago

Yes, I posted a 6.6.36 diff/patch file. You can probably rebase it for 6.10 with very little effort. I am retired and unlikely to keep up with multiple kernel releases per year. I have instead focused on occasional LTS (long term support) releases.

I am curious to hear more about your KUtrace use. /dick

On Thursday, August 15, 2024, Kathe @.***> wrote:

I believe there has been a release at the 6.6 mark; https://github.com/dicksites/KUtrace/tree/master/linux/linux-6.6.36 Would @dicksites https://github.com/dicksites (author) confirm? Also, would @dicksites https://github.com/dicksites please consider supporting the latest kernel? I believe it is 6.10! CC: @ephut https://github.com/ephut

— Reply to this email directly, view it on GitHub https://github.com/clearlinux/distribution/issues/3168#issuecomment-2292812338, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMR3XI2G4QI2PO2LDF7UGUDZRWBX7AVCNFSM6AAAAABMLXYSSGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJSHAYTEMZTHA . You are receiving this because you were mentioned.Message ID: @.***>

kathe commented 1 month ago

@dicksites, thanks for your confirmation, and I agree with your point about supporting only LTS releases.

My use of KUtrace: It can be a very useful tool, especially when used in tandem with BPF under Linux.

CC: @ephut @brendangregg

fenrus75 commented 2 weeks ago

In general, we do not add patches to our kernels for such things -- so this is really not going to work for us.

kathe commented 2 weeks ago

Are you saying that software developers should not have access to tools which afford performance improvement of applications under ClearLinux?

CC: @ephut

brendangregg commented 2 weeks ago

When I got involved in Linux tracing in 2014, there were already at least 10 different Linux tracers (ftrace, perf, LTT, LTTng, SystemTap, ktap, dtrace4Linux, Oracle DTrace, Intel pin, sysdig) and eBPF had potential for yet-another one, from which we now have bcc, bpftrace, and ply. Some of these have had enormous success and adoption: perf and Ftrace have widespread usage, and Ubuntu 24.04 recently made bpftrace and bcc a default install. I'd recommend ClearLinux support bcc and bpftrace if it doesn't already.

I have yet to see a strong argument for KUtrace, one that address the recent efficency work we've been doing in eBPF, Ftrace, kprobes, raw tracepoints, fentry, fast uprobes, etc.

kathe commented 2 weeks ago

@brendangregg, I'll write more on KUtrace usage via a private email to you soon. I have been informed by Redhat users (who did private patches on the kernel) about the extremely low overhead by KUtrace. I think @dicksites can confirm that as well as elaborate more authoritatively.

CC: @ephut

brendangregg commented 2 weeks ago

Red Hat has also been working on lowering the eBPF and its probe attach overhead in the last year. Minimum cost for eBPF kernel events, measured a few years ago, was 69 nanoseconds. I imagine it's now lower than that. What is it for KUtrace?

kathe commented 2 weeks ago

I urge @dicksites to respond to the query above by @brendangregg. Only @dicksites can comment authoritatively.

brendangregg commented 2 weeks ago

Note that 69ns is the (now old) minimum kprobe cost from eBPF. The Linux kernel also has Ftrace builtin which serves a lot of serious performance needs. If I were comparing KUTrace with anything, I'd be most interested if it can beat the FTrace minimum overhead, which I would guess would be much lower.

dicksites commented 2 weeks ago

I will take a look this weekend and try to do some careful comparison measurements. I'm running Linux 6.6.36 kernel on an Intel i3-7100. /dick

On Fri, Aug 30, 2024 at 3:30 PM Brendan Gregg @.***> wrote:

Note that 69ns is the (now old) minimum kprobe cost from eBPF. The Linux kernel also has Ftrace builtin which serves a lot of serious performance needs. If I were comparing KUTrace with anything, I'd be most interested if it can beat the FTrace minimum overhead, which I would guess would be much lower.

— Reply to this email directly, view it on GitHub https://github.com/clearlinux/distribution/issues/3168#issuecomment-2322500324, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMR3XI3AD4QJ4RWDEYAZ5MDZUDXAXAVCNFSM6AAAAABMLXYSSGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRSGUYDAMZSGQ . You are receiving this because you were mentioned.Message ID: @.***>

dicksites commented 1 week ago

I have no idea if I am measuring ftrace correctly, but here goes...

The ftrace numbers are obviously too high. I hope that Bfrendan or someone can instruct me exactly how to set up a better measurement.

Test program does 250K calls to getpid (the fastest syscall I think) to warm up the CPU clock(s) to high speed. Then times via gettimeofday() 50K calls to getpid The prints elapsed time and per call time. For KUtrace, there are TWO traced events per syscall, enter and exit, so I also print percall/2 to show the time per event. Calls are done via syscall(_NR_getpid, dummy1, dummy2); (I included the dummy parameters to compare against my internal calls with two parameters.) Loops are unrolled 4 times, so each loop interaction does foru syscalls; iterates 50000/4 times. Results of the syscalls are all added to a running sum, which is kept live outside the loop via a conditional printf, where the condition is always false but the compiler doesn't know that.

Running kernel 6.6.36 on an Intel i3-7100 (3.6 GHz nominal, 4 CPUs; actually two physical cores hyperthreaded).

Here are the KUtrace results: The base loop takes 250-252 nsec per syscall With KUtrace go, it takes 298-299 nsec per call, or an extra 24 nsec per KU-recorded event With KUtrace goipc, it takes 324-327 nsec per call, or an extra 38 nsec per KU-recorded event

goipc also records the instructions per cycle for every small, in this case ~100 nsec, traced time interval. This involves reading the instructions_retired performance counter for about 36 CPU cycles and then doing a 32-bit division for about 20 cycles. Thus, the large extra time.

in /sys/kernel/tracing, I set up these parameters: current_tracer = nop or function or function_graph tracing_on = 0 or 1 buffer_size_kb = 40000 max_graph_depth = 1 trace_clock = x86-tsc trace_options = raw and bin and overwrite

Here are the ftrace results: tracer=nop, tracing off 250-251 nsec per call (same as KUtrace off base case) tracer=function, tracing off 611-613 nsec per call, or an extra ~362 nsec per call (with no tracing) tracer=function, tracing on 1413-1479 nsec per call, or an extra ~1200 nsec per call (1 event per call) function just traces a single event, syscall_enter

tracer=function_graph, tracing off 692-694 nsec per call, or an extra ~440 nsec per call (with no tracing) tracer=function_graph, tracing on 1679-1680 nsec per call, or an extra ~1430 nsec per call (2 events per call) function_graph traces two events, syscall_enter and syscall_exit

Let me know where to go from here. Thanks, /dick p.s. I will send along some of the KUtrace output shortly.

On Fri, Aug 30, 2024 at 4:00 PM Dick Sites @.***> wrote:

I will take a look this weekend and try to do some careful comparison measurements. I'm running Linux 6.6.36 kernel on an Intel i3-7100. /dick

On Fri, Aug 30, 2024 at 3:30 PM Brendan Gregg @.***> wrote:

Note that 69ns is the (now old) minimum kprobe cost from eBPF. The Linux kernel also has Ftrace builtin which serves a lot of serious performance needs. If I were comparing KUTrace with anything, I'd be most interested if it can beat the FTrace minimum overhead, which I would guess would be much lower.

— Reply to this email directly, view it on GitHub https://github.com/clearlinux/distribution/issues/3168#issuecomment-2322500324, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMR3XI3AD4QJ4RWDEYAZ5MDZUDXAXAVCNFSM6AAAAABMLXYSSGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRSGUYDAMZSGQ . You are receiving this because you were mentioned.Message ID: @.***>

dicksites commented 1 week ago

Tail end of getpid loop with KUtrace. Full traces available on request. /dick

40 microseconds across the screen. Each user+system iteration takes 850 nsec on the left. Each syscall takes 100 nsec in the middle. Page fault upon loop exit takes 5.69 usec on the right.

[image: Screenshot 2024-09-02 at 16.53.32.png]

40 microseconds across the screen. Same binary executing with ftrace running function tracing as described in previous mail. Each user+system iteration takes 2710 nsec on the left. (time 826.50 - 823.79) Each syscall takes 1160 nsec in the middle. Page fault upon loop exit takes 12.5 usec on the right.

[image: Screenshot 2024-09-02 at 16.55.52.png]

On Mon, Sep 2, 2024 at 4:42 PM Dick Sites @.***> wrote:

I have no idea if I am measuring ftrace correctly, but here goes...

The ftrace numbers are obviously too high. I hope that Bfrendan or someone can instruct me exactly how to set up a better measurement.

Test program does 250K calls to getpid (the fastest syscall I think) to warm up the CPU clock(s) to high speed. Then times via gettimeofday() 50K calls to getpid The prints elapsed time and per call time. For KUtrace, there are TWO traced events per syscall, enter and exit, so I also print percall/2 to show the time per event. Calls are done via syscall(_NR_getpid, dummy1, dummy2); (I included the dummy parameters to compare against my internal calls with two parameters.) Loops are unrolled 4 times, so each loop interaction does foru syscalls; iterates 50000/4 times. Results of the syscalls are all added to a running sum, which is kept live outside the loop via a conditional printf, where the condition is always false but the compiler doesn't know that.

Running kernel 6.6.36 on an Intel i3-7100 (3.6 GHz nominal, 4 CPUs; actually two physical cores hyperthreaded).

Here are the KUtrace results: The base loop takes 250-252 nsec per syscall With KUtrace go, it takes 298-299 nsec per call, or an extra 24 nsec per KU-recorded event With KUtrace goipc, it takes 324-327 nsec per call, or an extra 38 nsec per KU-recorded event

goipc also records the instructions per cycle for every small, in this case ~100 nsec, traced time interval. This involves reading the instructions_retired performance counter for about 36 CPU cycles and then doing a 32-bit division for about 20 cycles. Thus, the large extra time.

in /sys/kernel/tracing, I set up these parameters: current_tracer = nop or function or function_graph tracing_on = 0 or 1 buffer_size_kb = 40000 max_graph_depth = 1 trace_clock = x86-tsc trace_options = raw and bin and overwrite

Here are the ftrace results: tracer=nop, tracing off 250-251 nsec per call (same as KUtrace off base case) tracer=function, tracing off 611-613 nsec per call, or an extra ~362 nsec per call (with no tracing) tracer=function, tracing on 1413-1479 nsec per call, or an extra ~1200 nsec per call (1 event per call) function just traces a single event, syscall_enter

tracer=function_graph, tracing off 692-694 nsec per call, or an extra ~440 nsec per call (with no tracing) tracer=function_graph, tracing on 1679-1680 nsec per call, or an extra ~1430 nsec per call (2 events per call) function_graph traces two events, syscall_enter and syscall_exit

Let me know where to go from here. Thanks, /dick p.s. I will send along some of the KUtrace output shortly.

On Fri, Aug 30, 2024 at 4:00 PM Dick Sites @.***> wrote:

I will take a look this weekend and try to do some careful comparison measurements. I'm running Linux 6.6.36 kernel on an Intel i3-7100. /dick

On Fri, Aug 30, 2024 at 3:30 PM Brendan Gregg @.***> wrote:

Note that 69ns is the (now old) minimum kprobe cost from eBPF. The Linux kernel also has Ftrace builtin which serves a lot of serious performance needs. If I were comparing KUTrace with anything, I'd be most interested if it can beat the FTrace minimum overhead, which I would guess would be much lower.

— Reply to this email directly, view it on GitHub https://github.com/clearlinux/distribution/issues/3168#issuecomment-2322500324, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMR3XI3AD4QJ4RWDEYAZ5MDZUDXAXAVCNFSM6AAAAABMLXYSSGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRSGUYDAMZSGQ . You are receiving this because you were mentioned.Message ID: @.***>