draios / sysdig

Linux system exploration and troubleshooting tool with first class support for containers
http://www.sysdig.com/
Other
7.75k stars 728 forks source link

Include support for inode monitoring #289

Open Lakshmipathi opened 9 years ago

Lakshmipathi commented 9 years ago

At the moment sysdig supports filename (sysdig -A -c echo_fds "fd.filename=passwd") and doesn't provide support for inode number. For example , fd.inode="54321" assuming its inode for passwd. So this way it captures I/O to this specific file via its hardlink or symlink too.

jarun commented 9 years ago

Can someone confirm if this is required? I can try this out.

Hefeweizen commented 9 years ago

confirmed.

--=-- $ ln -s /etc/passwd foo $ cat foo $ sysdig -zr inode.20150221.1324.scap.z 'evt.type=open' -p "%fd.name" /var/run/utmp /etc/ld.so.cache /lib/x86_64-linux-gnu/libc.so.6 /usr/lib/locale/locale-archive /home/vagrant/foo /var/run/utmp

jarun commented 9 years ago

Thanks for confirming. I am quite new to sysdig. Can you please explain your example so that I get a clearer picture of what you wanted to convey?

Hefeweizen commented 9 years ago

sure; no problem.

My understanding of the original request was that sysdig be able to dereference file access irrespective of exact file name. The request mentioned inode, as they had some understanding of implementation, but I chose to test the intent.

In my test, I created a symlink to /etc/password in my cwd, read its contents, and then checked to see if sysdig tracked the file call on the original file. It did not.

--=-- Returning to this topic, I'll comment on implementation: tracking on inode is inexact. Inodes can be repeated in different filesystems. That said, a simple solution would be to combine the inode filter with "fd.directory contains mount-point". However, in this instance, would fd.directory be the path to the original symlink or to the target file?

jarun commented 9 years ago

Hardlinks to the same file have the same inode number. However, a symbolic link has a distinct inode number of its own. So how are you guys relating the symlink to the inode of the target file here? Or do you want the implementation to work such that tracking the symlink tracks the target file too?

Hefeweizen commented 9 years ago

@jarun, my understanding of the request is that by applying a filter for inode=1234 should get any access attempts to that chunk of disk, no matter how it got there, either by file name, hard link*, or symlink. Yes, a symlink has its own inode entry, but after its lookup, it then points to another file name, which has its own inode. It's this second inode that I would expect the filter to observe.

(* - and implementation-wise, there's no difference between original file name and hard link)

jarun commented 9 years ago

Thanks for clarifying.

jarun commented 9 years ago

Started working on it and need some help with the flow. I have traversed the code from sysdig_init() to sinsp_parser::process_event(). What is the event which needs to be parsed for the fd related processing? Or is it multiple events like PPME_SYSCALL_READ_X, PPME_SYSCALL_WRITE_X etc.? In which function is the processing done for fd? parse_rw_exit() or m_inspector->m_filter->run(evt)?

Do I need to make changes in the kernel driver for inode tracking? What are the relevant structures or functions?

Even if the questions seem too simple, I am quite lost tracing through the userspace calls using gdb.

omid-s commented 5 years ago

any body knows of a progress on this issue ? or any other way one can simulate an inode?
fd.uid has the closest one i guess but it's a chaining of thread id to fdnum which is very process dependent.

gianlucaborello commented 5 years ago

I'm working on a new feature that could heavily benefit from getting the device/inode numbers every time a new file is opened (in my case it's for uniquely identifying memory mapped executables across different processes/containers). So, you can expect that this will come at some point (for the moment I'm resolving it by using /proc/$PID/maps, which contains, among other things, the device/inode combo for any mapped file, but of course is less ideal than if it was coming straight from the kernel events).

gianlucaborello commented 5 years ago

(Commenting if someone else ventures into this)

I spent some time today in what I thought would be an easy change to the kernel driver/eBPF probe to export the device/inode numbers for open/mmap operations, in order to have better auditing.

Surprisingly it's not easy, because we can't just assume that the struct navigation task_struct -> fd -> struct file -> struct inode -> dev/inode works: in most file systems it works that way, but in some cases (e.g. overlayfs2, widely used in containers), the file system code has the ability to actually keep the real inode stored somewhere else, and the inode accessible via the struct navigation above is more or less a meaningless sequential number that is also going to be changing at runtime.

So, the correct fix would be to really emulate what the stat() system call does, which is properly querying the vfs subsystem using the right inode_operations installed for the file. It is basically impossible to do such thing from eBPF and it might also be unsafe from the kernel driver, so for the moment I'm giving up and sadly patching things up in userspace using stat().

ldegio commented 5 years ago

can we adopt the kernel approach for the majority of the cases where it would work and only using stat() when strictly required? I assume it would be more efficient?

gianlucaborello commented 5 years ago

The problem is that we don't know when it is "strictly required": there is no obvious way (that I could find, at least) to distinguish an invalid inode from a valid one, everything happens inside the fs-specific code. It definitely took me a while to realize, I had all the code working with several tests as well, and when I used it inside a container it was not working, and had to trace what happens with perf.

The only thing I can think about would be heuristics, but they would either incur into too low hit ratio (e.g. discard every inode coming from a process inside a container, which these days would still mean the majority of system activity) or horrible accuracy (e.g. parse in sysdig all the current mounted file systems for every container, then somehow resolve all opened/mmapped symlinks against those file systems, and filter our the inodes that don't correspond to a "good" file system).

For my use case I really need bullet-proof accuracy since experiencing a false positive could mean doing dynamic instrumentation of the same binary twice and that's not an idempotent operation (and the whole reason why I moved away from just path names as unique identifiers), so I'll likely be doing those stat() calls in a separate thread (the same one where I do the dynamic instrumentation itself) and evaluate the impact.

Also, for my use case I am only interested in mmapped files with PROT_EXEC, so that should be orders of magnitude lower than doing it for all open/mmap events (usually there are ~5-10 shared libraries mapped in any process).

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.