Process containment using FUSE instead of CGROUP

ccaapton commented 10 years ago

Currently cgroups is used in systemd/openrc for process containment. Although cgroups is a very easy solution on linux to handle daemon double fork or crash, it seems to be the major stopper for system/openrc to be cross platform.

I'm thinking about an alternative approach for process containment, utilizing the widely accepted FUSE interface in all major unix systems. Below is a brief description: a). A fuse daemon will providing a special file system, let's say /run/initfs b). For every daemon which needs to be contained, we can start a helper process first, and open a file in the special file system. For instance, to start apache daemon, we start the helper and create/open a file "/run/initfs/apache". Make sure close-on-exit is NOT set on this file descriptor. c). Fork-exec to start the daemon. Now we can identify all process with reference to "/run/initfs/apache" as a part of the apache daemon. d). The initfs could prevent daemons from accidentally closing this fd, by returning error code to "close()" call from clients via fuse api, unless the client process exits.

How do you think?

qnikst commented 10 years ago

just to note, openrc doesn't not depend on cgroups in the way systemd do, so it's not a blocker, just some part of functionality will be missing.

Idea is nice, but are there any implementation of such FUSE? I don't see a way how to properly implement it w/o ptracing/linux specific solutions or solution in kernel.

b). some PID-1 like s6 are very close to that idea, however there are a difficulties to use them in a clear way by openrc.

williamh commented 10 years ago

This patch, https://bugs.gentoo.org/show_bug.cgi?id=501364#C9, is my latest patch to add runit support. The only question I have, which I am waiting for feedback for is, I'm not sure when to start the program that monitors /run/openrc/sv/runit.

Once runit is in the tree, it will be easy to follow the same pattern for s6.

CameronNemo commented 10 years ago

Why not just use sessions? If a daemon is a forking daemon just use getsid() and voila, you know the sid to kill. There is always the issue of a non-main PID child setsid()'ing, but usually you want to keep those alive anyway.

jcnelson commented 10 years ago

Hi ccaapton,

Further research on lkml (back during the discussions on the semantics of close() and fsync() with regards to handling EINTR) and on the FUSE mailing list indicates that the kernel releases a file descriptor's resources regardless of whether or not the VFS implementation's release() method succeeds. So, even if a FUSE filesystem implemented release() to always fail, the kernel will always deallocate the file descriptor on close(), and it will always disappear from /proc.

I tried this with a toy FUSE implementation to confirm this.

ccaapton commented 10 years ago

@jcnelson Thanks for the information, and the toy implementation to verify my thoughts!

So when a process calls close(), does it take effect immediately, or will it block and wait for the fuse daemon to 'release()'? If it blocks, then the fuse daemon can kill this process immediately at this point to avoid escaping.

I'm not sure if there is any racing condition in close() and a sys_clone() from other threads. When close is called but not confirmed by kernel, the process may fork a child. Does this child have the old file descriptor? If yes, then the child process will always be contained.

jcnelson commented 10 years ago

@ccaapton According to Linus, close() takes effect immediately. The VFS release() method can get called well after close() returns, even after the process that called close() dies.

If my understanding of the Linux VFS is correct, the purpose of release() is for the kernel to inform the VFS implementation that it's done with the given file descriptor, so the VFS can go ahead and free any file descriptor-specific state. This is completely decoupled from the process that called close().

If you're going to avoid cgroups for process tracking, I think @CameronNemo's approach is sensible. Even with cgroups, a privileged daemon can place its child in a separate cgroup hierarchy, and a privileged child can place itself out of its parent's cgroup hierarchy. So cgroups alone aren't sufficient to prevent child escapes--you're going to have to audit the daemon to ensure it drops the requisite privileges before it fork()s (or make sure it doesn't start with privileges). But if you're going to go that far, you could also verify that its children never call setsid() :)

OpenRC / openrc

Process containment using FUSE instead of CGROUP #19