arm64/5.10: Failed to calibrate kretprobe func IP extraction.

martinetd commented 1 year ago

So it turns out the board at work I wanted to use retsnoop on is running a 5.10 kernel (with no plans to upgrade, welcome to embedded), which doesn't have get_func_ip (added in 5.15) yet.

From a quick look at the code, there is a fallback through the calibration step that should get kret_ip_off but it's stuck at 0, and retsnoop refuses to run:

Feature detection:
    BPF ringbuf map supported: yes
    bpf_get_func_ip() supported: no
    bpf_get_branch_snapshot() supported: no
    BPF cookie supported: no
    multi-attach kprobe supported: no
Feature calibration:
    kretprobe IP offset: 0
    fexit sleep fix: yes
    fentry re-entry protection: no
Failed to calibrate kretprobe func IP extraction.
Failed to perform feature calibration: -14

After adding some logs to calib_exit I noticed entry_ip was off by one, so for aarch64 I got retsnoop to work with this:

diff --git a/src/calib_feat.bpf.c b/src/calib_feat.bpf.c
index d558824d0151..bbbb1d33b7b2 100644
--- a/src/calib_feat.bpf.c
+++ b/src/calib_feat.bpf.c
@@ -33,7 +33,7 @@ int calib_entry(struct pt_regs *ctx)
    /* Used for kretprobe function entry IP discovery, before
     * bpf_get_func_ip() helper was added.
     */
-   entry_ip = PT_REGS_IP(ctx) - 1;
+   entry_ip = PT_REGS_IP(ctx);

    /* Detect if bpf_get_func_ip() helper is supported by the kernel.
     * Added in: 9b99edcae5c8 ("bpf: Add bpf_get_func_ip helper for tracing programs")
diff --git a/src/mass_attach.bpf.c b/src/mass_attach.bpf.c
index 9976b4d58540..e4b6ec5001ea 100644
--- a/src/mass_attach.bpf.c
+++ b/src/mass_attach.bpf.c
@@ -94,7 +94,7 @@ int kentry(struct pt_regs *ctx)
    if (has_bpf_get_func_ip)
        ip = bpf_get_func_ip(ctx);
    else
-       ip = PT_REGS_IP(ctx) - 1;
+       ip = PT_REGS_IP(ctx);

    if (has_bpf_cookie) {
        id = bpf_get_attach_cookie(ctx);

Is that something that changes depending on the arch? Looking at the git log that -1 has more or less always been here, so I assume that on x86_64 it is needed? If you know how that works we could probably make a list of archs that do or don't need it to fix this; there's still plenty of machines with pre-5.15 kernels out here.

With that change done, retsnoop works okay-ish for me: it's just quite slow with a large number of probes, e.g. from the example on your blog with simfail bpf-simple-obj doing full traces I get the following times:

# ./retsnoop -e 'bpf_prog_load' -a ':kernel/bpf/*.c' -d 'bpf_lsm*' -d 'push_insn' -d '__mark_reg_unknown' -S -T -l -vv
...
Total 733 kernel functions attached successfully!
Successfully attached in 10040 ms.
...
^C
Detaching... DONE in 43517 ms.

10s to attach and 43s to detach is a tad slow; but I guess that's just hardware limitations. With less probes it's somewhat reasonable (respectively attach/detach times: 10/104ms for 1 probe, 1462/4371 for 70... it's actually almost linear). I assume multi-attach introduced in 5.18 would speed that up quite a bit?

anakryiko commented 1 year ago

@martinetd hey, sorry for delay, just got around to retsnoop stuff

re: -1, I honestly don't remember details by now. But I also suspect that I just saw that PT_REGS_IP(ctx) returns real IP + 1 and just compensated. Why it is + 1 -- no idea. Seems like that compensation is not necessary on arm64, so I think it's reasonable to special-case x86_64 here. Would you like to send a patch? I don't have access to arm64 at the moment (I'll see if I can get one), so would appreciate testing you can do.

As for slow attachment to lots of functions. This gets much (and I mean MUCH) faster starting from 5.18 kernel, when multi-attach kprobes were introduced. But it's not hardware limitation, it's just lots of waiting inside the kernel between each individual BPF program attachment. Pure waste which was finally solved by multi-attach kprobes.

But unfortunately you are stuck on 5.10, so this attachment (and it is only attachment) slowness will stay.

martinetd commented 1 year ago

But I also suspect that I just saw that PT_REGS_IP(ctx) returns real IP + 1 and just compensated

That's fair!

I don't like blindly making the +1 or not architecture dependent when I don't know which arch does what, so I looked at it a bit more and it seems to be CPU dependent rather than arch dependent? In the linux code if you grep for "precise_ip" there's a bunch of matches, and depending on how the code is run it might not always be there:

arch/x86/events/intel/ds.c:
                /*
                 * Haswell and later processors have an 'eventing IP'
                 * (real IP) which fixes the off-by-1 skid in hardware.
                 * Use it when precise_ip >= 2 :
                 */

I guess we don't need a perfect fix given there's a better method available with >=5.15 kernels, but perhaps just assuming the return ip must be rounded down be better? I don't think any return ip can be anything other than 0 & 0x7 as we often see this used for tagged pointers...

This gets much (and I mean MUCH) faster starting from 5.18 kernel, when multi-attach kprobes were introduced.

Great! Not that happy to read it's a pure waste of time, but it'll go away eventually. I might wait a few years for multi attach before showing retsnoop off to coworkers but it's not like there's a hurry, and I'll just put up with it until then ;)

anakryiko commented 1 year ago

But I also suspect that I just saw that PT_REGS_IP(ctx) returns real IP + 1 and just compensated

That's fair!

I don't like blindly making the +1 or not architecture dependent when I don't know which arch does what, so I looked at it a bit more and it seems to be CPU dependent rather than arch dependent? In the linux code if you grep for "precise_ip" there's a bunch of matches, and depending on how the code is run it might not always be there:
arch/x86/events/intel/ds.c:
                /*
                 * Haswell and later processors have an 'eventing IP'
                 * (real IP) which fixes the off-by-1 skid in hardware.
                 * Use it when precise_ip >= 2 :
                 */     
I guess we don't need a perfect fix given there's a better method available with >=5.15 kernels, but perhaps just assuming the return ip must be rounded down be better? I don't think any return ip can be anything other than 0 & 0x7 as we often see this used for tagged pointers...

well, on my system I see in /proc/kallsyms:

ffffffff839c14e4 T classes_init
ffffffff839c1510 T platform_bus_init
ffffffff839c155d T cpu_dev_init
ffffffff839c15a9 T firmware_init
ffffffff839c15cd T driver_init

Which suggests & ~0x7 trick won't work, unfortunately?

This gets much (and I mean MUCH) faster starting from 5.18 kernel, when multi-attach kprobes were introduced.

Great! Not that happy to read it's a pure waste of time, but it'll go away eventually. I might wait a few years for multi attach before showing retsnoop off to coworkers but it's not like there's a hurry, and I'll just put up with it until then ;)

It quite usable when attaching up to a hundred of functions or so, but I understand :) But there are situations where I'd rather way for 5 minutes for retsnoop to start just to be able to understand some kernel error, so...

martinetd commented 1 year ago

Which suggests & ~0x7 trick won't work, unfortunately?

Ugh, I guess I could have checked a bit more thoroughly sorry.

Well, offsetting by 1 conditionally on arch won't be worse than what we currently have anyway, so let's go with that. I'll try to whip up a patch next week-ish.

anakryiko / retsnoop

arm64/5.10: Failed to calibrate kretprobe func IP extraction. #32