feature request: inspect fd.name and fd.directory path in the initial mount namespace

AkihiroSuda commented 5 years ago

The Write below binary dir rule of Falco can't detect writes from containers (with bind-mounts), because fd.directory is resolved into a path in a mount namespace (container), not into the path in the initial mount namespace ("host").

It would be great if Falco/sysdig can also inspect the path in the initial mount namespace.

krisnova commented 5 years ago

I think the keyword here is also .

In my mind we want to keep the inspection in the unshared namespace in the container, while also inspecting the original mount namespace on the host. I think we should be able to do both (given we can track down the init namespace). We just need a way of separating the two elements if we can find both.

I also think this same pattern could be repeated for other namespaces.

If I was to look at making a PR, I would start looking here: https://github.com/draios/sysdig/blob/dev/userspace/libscap/scap_fds.c#L1882-L1885

gnosek commented 5 years ago

We can't just make fd.name and fd.directory contain init-ns paths for two reasons:

compatibility
there might not be a path in the init ns at all (a filesystem may have been mounted directly in the child namespace)

Having said that, if there is a corresponding init ns path, we should be able to get it based on /proc/pid/mountinfo contents in both namespaces. It may be tricky to get right (there are things like overlapping filesystems and mount --move to handle) but should be possible.

Still, we'd need to cache the mountinfo data for performance reasons and we would need to invalidate the cache at the right time.

For the common case, we can just ignore the invalidation and expect that you cannot mount a disk-based filesystem in a container (you can mount e.g. a tmpfs or maybe an nfs share but neither of these correspond to host paths).

The story for well-behaved Docker-like containers ends here. For custom/malicious stuff, it only begins.

We might e.g. hook into mount event processing to detect mount changes; then we'd still have to reread the file periodically in case we lost any events--and there's a balance to strike (too frequent and we kill performance, too rare and there's a larger window where we can misreport the host paths).

Ideally, we need to keep the mountinfo per process (or maybe even thread, not sure about that), not just per container and reimplement the kernel logic of private/slave/rslave mounts in userspace. If we get this wrong (or just ignore it), a malicious containerized process with enough access will be able to change/hide the host paths by e.g. unshareing into a new mount namespace and mount --moveing things around.

We could also try getting the relevant data from the driver but I'd have to check the sources to see if there's a viable kernel API for this.

fntlnz commented 5 years ago

When it's about executing and not writing to those binaries like the rule that @AkihiroSuda posted we also have the same kind of issue when dealing with symlinks.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

draios / sysdig

feature request: inspect fd.name and fd.directory path in the initial mount namespace #1508