Closed martinetd closed 1 year ago
@martinetd hey, sorry for delay, just got around to retsnoop stuff
re: -1, I honestly don't remember details by now. But I also suspect that I just saw that PT_REGS_IP(ctx) returns real IP + 1 and just compensated. Why it is + 1 -- no idea. Seems like that compensation is not necessary on arm64, so I think it's reasonable to special-case x86_64 here. Would you like to send a patch? I don't have access to arm64 at the moment (I'll see if I can get one), so would appreciate testing you can do.
As for slow attachment to lots of functions. This gets much (and I mean MUCH) faster starting from 5.18 kernel, when multi-attach kprobes were introduced. But it's not hardware limitation, it's just lots of waiting inside the kernel between each individual BPF program attachment. Pure waste which was finally solved by multi-attach kprobes.
But unfortunately you are stuck on 5.10, so this attachment (and it is only attachment) slowness will stay.
But I also suspect that I just saw that PT_REGS_IP(ctx) returns real IP + 1 and just compensated
That's fair!
I don't like blindly making the +1 or not architecture dependent when I don't know which arch does what, so I looked at it a bit more and it seems to be CPU dependent rather than arch dependent? In the linux code if you grep for "precise_ip" there's a bunch of matches, and depending on how the code is run it might not always be there:
arch/x86/events/intel/ds.c:
/*
* Haswell and later processors have an 'eventing IP'
* (real IP) which fixes the off-by-1 skid in hardware.
* Use it when precise_ip >= 2 :
*/
I guess we don't need a perfect fix given there's a better method available with >=5.15 kernels, but perhaps just assuming the return ip must be rounded down be better? I don't think any return ip can be anything other than 0 & 0x7 as we often see this used for tagged pointers...
This gets much (and I mean MUCH) faster starting from 5.18 kernel, when multi-attach kprobes were introduced.
Great! Not that happy to read it's a pure waste of time, but it'll go away eventually. I might wait a few years for multi attach before showing retsnoop off to coworkers but it's not like there's a hurry, and I'll just put up with it until then ;)
But I also suspect that I just saw that PT_REGS_IP(ctx) returns real IP + 1 and just compensated
That's fair!
I don't like blindly making the +1 or not architecture dependent when I don't know which arch does what, so I looked at it a bit more and it seems to be CPU dependent rather than arch dependent? In the linux code if you grep for "precise_ip" there's a bunch of matches, and depending on how the code is run it might not always be there:
arch/x86/events/intel/ds.c: /* * Haswell and later processors have an 'eventing IP' * (real IP) which fixes the off-by-1 skid in hardware. * Use it when precise_ip >= 2 : */
I guess we don't need a perfect fix given there's a better method available with >=5.15 kernels, but perhaps just assuming the return ip must be rounded down be better? I don't think any return ip can be anything other than 0 & 0x7 as we often see this used for tagged pointers...
well, on my system I see in /proc/kallsyms:
ffffffff839c14e4 T classes_init
ffffffff839c1510 T platform_bus_init
ffffffff839c155d T cpu_dev_init
ffffffff839c15a9 T firmware_init
ffffffff839c15cd T driver_init
Which suggests & ~0x7
trick won't work, unfortunately?
This gets much (and I mean MUCH) faster starting from 5.18 kernel, when multi-attach kprobes were introduced.
Great! Not that happy to read it's a pure waste of time, but it'll go away eventually. I might wait a few years for multi attach before showing retsnoop off to coworkers but it's not like there's a hurry, and I'll just put up with it until then ;)
It quite usable when attaching up to a hundred of functions or so, but I understand :) But there are situations where I'd rather way for 5 minutes for retsnoop to start just to be able to understand some kernel error, so...
Which suggests & ~0x7 trick won't work, unfortunately?
Ugh, I guess I could have checked a bit more thoroughly sorry.
Well, offsetting by 1 conditionally on arch won't be worse than what we currently have anyway, so let's go with that. I'll try to whip up a patch next week-ish.
So it turns out the board at work I wanted to use retsnoop on is running a 5.10 kernel (with no plans to upgrade, welcome to embedded), which doesn't have
get_func_ip
(added in 5.15) yet.From a quick look at the code, there is a fallback through the calibration step that should get
kret_ip_off
but it's stuck at 0, and retsnoop refuses to run:After adding some logs to
calib_exit
I noticed entry_ip was off by one, so for aarch64 I got retsnoop to work with this:Is that something that changes depending on the arch? Looking at the git log that -1 has more or less always been here, so I assume that on x86_64 it is needed? If you know how that works we could probably make a list of archs that do or don't need it to fix this; there's still plenty of machines with pre-5.15 kernels out here.
With that change done, retsnoop works okay-ish for me: it's just quite slow with a large number of probes, e.g. from the example on your blog with simfail bpf-simple-obj doing full traces I get the following times:
10s to attach and 43s to detach is a tad slow; but I guess that's just hardware limitations. With less probes it's somewhat reasonable (respectively attach/detach times: 10/104ms for 1 probe, 1462/4371 for 70... it's actually almost linear). I assume multi-attach introduced in 5.18 would speed that up quite a bit?