File Events: Implement EBPF_EVENT_FILE_DELETE

fntlnz commented 2 years ago

Stories

As a user of the EventsTrace program I want to be able to see printed on screen when I file is deleted.
As a user of the libebpf library I want to be able to receive file deletion events when calling ebpf_event__next

Data needed

Field	Type	Description
File Name	char[256]	New name of the file being deleted
Full File Path	char[4096]	Full source path of the file being deleted

Probe

As a starting point , some possible hooking points for this event could be:

~fexit on int security_path_unlink(const struct path *dir, struct dentry *dentry); - Neds a kernel compiled with CONFIG_SECURITY_PATH~ - We can't do this, see this comment https://github.com/elastic/ebpf/issues/46#issuecomment-1000455585
fexit on iint vfs_unlink(struct user_namespace *mnt_userns, struct inode *dir, struct dentry *dentry, struct inode **delegated_inode)

Action items

[ ] Write one or more probes that can catch when a file is deleted
[ ] Write BPF_PROG_TEST_RUN tests for the probes (if supported)
[ ] Write integration tests for the probes
[ ] Hook the event to be printed by the EventsTrace program

fntlnz commented 2 years ago

An initial version of this program was added in https://github.com/elastic/ebpf/pull/40

We have a few issues to fix related to dentries (the following will likely impact this as well as #44 and #45)

[ ] Check if the memory impact of ebpf_event_file_path__from_dentry when building path arrays is lowe enough for the frequency of usage or needs to be optimized (initially suggested in review by @rhysre)
[ ] Check if the path depth we are using is enough and what happens - What happens in such scenario needs to be investigated and we need to understand what are the limitations of this if any.
[ ] When some conditions are met the root directory of the path is not found (@nicholasberlin was suggesting this might be because of the different mountpoints) - This needs to be investigaged and fixed

fntlnz commented 2 years ago

The approach we are using right now with security_path_unlink needs to be changed in favor (probably) to vfs_unlink because AL2 do not have a kernel with CONFIG_SECURITY_PATH even though they ship very recent kernels. I suspect this might be common in other distros.

AL2 aarch64

cat /boot/config-5.10.82-83.359.amzn2.aarch64 | grep -i CONFIG_SECURITY_PATH
# CONFIG_SECURITY_PATH is not set

AL2 x86_64

cat /boot/config-5.10.75-79.358.amzn2.x86_64 | grep -i CONFIG_SECURITY_PATH
# CONFIG_SECURITY_PATH is not set

rhysre commented 2 years ago

The approach we are using right now with security_path_unlink needs to be changed in favor (probably) to vfs_unlink because AL2 do not have a kernel with CONFIG_SECURITY_PATH even though they ship very recent kernels. I suspect this might be common in other distros.

AL2 aarch64
cat /boot/config-5.10.82-83.359.amzn2.aarch64 | grep -i CONFIG_SECURITY_PATH
# CONFIG_SECURITY_PATH is not set
AL2 x86_64
cat /boot/config-5.10.75-79.358.amzn2.x86_64 | grep -i CONFIG_SECURITY_PATH
# CONFIG_SECURITY_PATH is not set

Indeed, this was an issue for us at Cmd as well. Lots and lots of distros we needed to support had CONFIG_SECURITY_PATH disabled. I went back and dug up a summary I put together a while ago about exactly that (note this is only x86, and only the kernels we at Cmd supported, things may have changed in really new kernels we don't yet support):

Status of CONFIG_SECURITY_PATH:

Fedora:         disabled for all
Amazon Linux 2: disabled for all
Oracle UEK:     Mish-mash, disabled for most, but enabled for a handful of 3.10 kernels
CentOS:         enabled for all 3.10 kernels (CentOS 7), disabled for all 4.18 kernels (CentOS 8)
Ubuntu:         enabled for all
Debian:         enabled for all
Kali:           enabled for all
Google COS:     enabled for all

The general theme seems to be that distros that use selinux (which doesn't rely on pathnames) have CONFIG_SECURITY_PATH disabled, while distros that use apparmor/TOMOYO or another LSM that's based on pathnames, have it enabled.

@mattnite and I implemented file events probes for Cmd that were kprobe-based (we gather information on open/unlink/link/rename). We have those probes already written, which (if we decide to go that route) will likely save the person implementing them over here a bunch of time. Naturally, a kprobes-based solution is a bit flaky due to the unstable function args, but we found it to be the best possible solution that works on all kernels we need to support.

fntlnz commented 2 years ago

@rhysre thanks for the analysis 🧐

It is trivial to implement the program itself, I’m more interested in finding a solution that is easy to maintain and evolve. As you mentioned kprobes are a bit flaky. I still think we want to go the fentry route for this stuff. I’ve been thinking about attaching to the vfs directly because it is the most fundamental unit that gets called all the times in n these events. I’m excluding tracepoints on syscalls because I want to make sure we are 100% covered.

I also think that (in the future ) we should cover the same events with multiple program types and choosing which ones to use based on the capabilities. I just want to start with the broader one now for these events.

Makes sense to you ? If you have any other ideas please lmk ☺️

rhysre commented 2 years ago

As you mentioned kprobes are a bit flaky. I still think we want to go the fentry route for this stuff.

Correct me if I'm wrong, but fundamentally isn't this more-or-less the same as using kprobes? My understanding is that with fentry probes we get better performance (because there's no CPU interrupt like in kprobes) and we have access to typed arguments via BTF. So some niceties, but nothing fundamentally changes between attaching a kprobe to vfs_open vs an fentry probe to vfs_open.

It is trivial to implement the program itself

You've got more experience in this area than I do, so please do correct me if I'm missing something (I'd absolutely love for that to be the case 😛 ), but the file-telemetry probes we implemented at Cmd were nontrivial and took a while (a few weeks) to write correctly. Based on all the knowledge I've gleaned researching the vfs subsystem, it's not an easy task.

We attached to the vfs_<operation> functions (like you suggested), however all of them except vfs_open don't provide access to the struct path being operated on.

Take vfs_unlink for example, it's prototype is:

int vfs_unlink(struct user_namespace *mnt_userns, struct inode *dir,
           struct dentry *dentry, struct inode **delegated_inode)

We can't construct the full path being unlinked with just the struct dentry, that will only allow us to get the pathname up to the first mountpoint. To get the full, absolute path, we need the struct path. We got the struct path by attaching to mnt_want_write, which is called before vfs_unlinkand passed the struct vfsmount. We cached the struct vfsmount in a map and later used it to construct the struct path (by combining it with the struct dentry passed to vfs_unlink).

These probes relied on the subject-to-change implementation detail that mnt_want_write was called before vfs_[unlink,link,rename], had to maintain state between various kernel functions being called via a map, and was super finicky. I often called it a rube goldberg, but it worked, and we didn't see any other way to gather file data for all the kernels we wanted to support.

On top of all this complexity, overlayfs added another whole layer of complexity to everything. If we want to support telemetry on file operations on overlayfs filesystems (i.e. In basically all containers), we've got to deal with it's quirks. Namely, overlayfs code (which is called into by the VFS layer) itself calls back into the VFS layer, so you get something like this:

open syscall entry -> ... -> vfs_open -> ovl_open -> ovl_open_realfile -> vfs_open (2nd time) -> <fs_specific stuff for file on actual filesystem>.

The 2nd time vfs_open is hit, it's hit with a dummy struct path, which can't be resolved correctly with dentry walking (see open_with_fake_path). Rename/link/unlink all do similar things on overlayfs. We thus have to avoid this 2nd hit of vfs_open by caching whether or not it's been hit before in a map. Naturally, we're relying on even more subject-to-change implementation details by doing this.

So tl;dr if we want to avoid security_path_<operation> to broaden the list of kernels we can support, attaching to functions in the VFS layer seems like the next best thing, but based on the research I've done into the subject, it's really finicky.

Again, I could be missing something here, please do let me know if so! If not though, I think whoever ends up implementing this will appreciate access to our prewritten probes. 😛

fntlnz commented 2 years ago

We attached to the vfs_ functions (like you suggested), however all of them except vfs_open don't provide access to the struct path being operated on.

totally agree on this, attaching a single probe doesn’t solve the problem but we need to do a chain like you suggested and iterate on that.

On top of all this complexity, overlayfs added another whole layer of complexity to everything. If we want to support telemetry on file operations on overlayfs filesystems (i.e. In basically all containers), we've got to deal with it's quirks. Namely, overlayfs code (which is called into by the VFS layer) itself calls back into the VFS layer, so you get something like this:

I suspected this would be an issue with this because we don’t instrument the syscall. I had no idea you already solved this problem in my original reply, good job .

As I said, I’ll happily take a look at what you did at cmd because I’m sure I can learn a lot from your work but I fundamentally think that to simplify things we should think about going for fentry/fexit over kprobes.

On the other hand kprobes would be easier to test with prog test run however, so please when you got some time let’s take a look at the cmd code together so we can have a better idea to make a decision ☺️

leodido commented 2 years ago

As you mentioned kprobes are a bit flaky. I still think we want to go the fentry route for this stuff.

Correct me if I'm wrong, but fundamentally isn't this more-or-less the same as using kprobes? My understanding is that with fentry probes we get better performance (because there's no CPU interrupt like in kprobes) and we have access to typed arguments via BTF. So some niceties, but nothing fundamentally changes between attaching a kprobe to vfs_open vs an fentry probe to vfs_open.

Another nicety arises when comparing fexits to kretprobes w.r.t to function parameters.

In my experience, getting the parameters from kretprobes is not reliable. It becomes a matter of luck. That's because of their nature: they get executed just before the function is exiting and at that point, the registers containing the input parameters could be gone... You would have to pair it with a corresponding kprobe to reliably fetch the input parameters.

Instead, the fexit programs guarantee the input parameters of the function you're tracing.

So 1 program (the fexit) rather than 2 (kprobe + kretprobe) to maintain in cases where we need to look up the exiting of a function.. :)

My 2 cents, ~leo

stanek-michal commented 2 years ago

About the fentry/fexit vs. kprobe/kretprobe - wouldn't we limit the range of supported kernel versions significantly if we went the fentry route? From what I understand, fentry/fexit programs are available since around 5.5.

elastic / ebpf