Investigate running the jailer with reduced set of capabilities

alexandruag commented 5 years ago

We currently start the jailer as the superuser (i.e. using sudo), and rely on the fact the process will deprivilege itself before exec-ing into Firecracker. It would be interesting to know if we can run the jailer using a more restricted set of capabilities instead of full superuser mode.

iulianbarbu commented 5 years ago

Investigated and found the following:

CAP_SYS_ADMIN and CAP_MKNOD capabilities are enough for running jailer without being superuser and without the flag --daemonize (CAP_SYS_ADMIN is needed for setns call from jailer/src/env.rs [1], and CAP_MKNOD for mknod calls from the same file [2])
the flag --daemonize leads to a setsid call, in the context of daemonizing the jailer, before executing firecracker as a child. The setsid call, as it stands in the documentation [3], has to be done by a process which has its PGID != PID, and which has no terminal attached. As it is right now, the jailer process is not entirely daemonized, because, by definition, a daemon process should have init as its parent, and the jailer process, started by a process different than init (e.g. bash shell), will not have init as its parent while its parent process does not call exit [*].

An interesting use case is that when starting the jailer using a session leader process which holds a tty or pts. Every new process started by the session leader (e.g a shell with job control -signaling, send to background, bring to foreground, killing entire process groups), either foreground process or background process, will be the owner of its process group, and it will have PID equal to PGID. [4]

This interferes with the way the jailer process becomes daemon in the current implementation of the jailer, because it tries to setsid when PID is equal with PGID, in the context of a non-superuser, and this leads to error. To change the PGID of a process, to make it different than PID, we can use setpgid [5]. However, setpgid comes with some restrictions, which makes it hard to change the PGID of the jailer process (for example, the new PGID has to be an existing process group from the same session group).

[*] Important notes for starting a daemon is closing standard input, output and err, and changing the session id, but all of these after a fork, in the child process. The parent process should make an exit call after forking, which leads to init picking up the orphaned forked process.

[1] http://man7.org/linux/man-pages/man2/setns.2.html [2] http://man7.org/linux/man-pages/man2/mknod.2.html [3] https://linux.die.net/man/2/setsid [4]http://poincare.matf.bg.ac.rs/~ivana/courses/ps/sistemi_knjige/pomocno/apue/APUE/0201433079/ch09lev1sec6.html [5] http://man7.org/linux/man-pages/man2/setpgid.2.html

alxiord commented 4 years ago

To ship this in production, we could give jailer the necessary capabilities as the last step in devtool build. As easy as

sudo setcap CAP_MKNOD,CAP_SYS_ADMIN+ep $cargo_bin_dir/jailer

The problems start to appear when adapting the integration tests. Now, the main reason why the integ tests are run as root is because jailer needs it. Eliminating this need, we should also "downgrade" pytest to run as an unprivileged user. However, pytest doesn't use the binaries generated by devtool build, instead it builds its own. The setcap invocation should be added here.

But: pytest is run inside the container, which would no longer be privileged, so it couldn't do setcap. Furthermore, the test setup often implies creating tap devices, which also wouldn't work unprivileged. We'd need to prepend each privilege-hungry command with sudo.

But: we don't have sudo in the container. Of course, we could add it in the Docker image, and add an unprivileged user to the image as well, and in the sudoers group.

Final but: the artefacts generated via the container (built binaries, caches, anything else) are currently owned by the user who launches devtool, and that's terribly convenient. If we were to chown everything inside the container with a different uid/gid, all these artefacts would be removed from the ownership of the user who triggered the build. That's cumbersome.

@firecracker-microvm/compute-capsule what are your thoughts on this conundrum?

bodenr commented 4 years ago

There doesn't appear to be much traction on this topic, but a few notes based on my recent experience while getting jailer/firecracker to run as non-root:

(a) The user jailer is run as must have access to the firecracker cgroups (/sys/fs/cgroup/cpu/firecracker, /sys/fs/cgroup/pids/firecracker and /sys/fs/cgroup/cpuset/firecracker) or if these cgroups don't exist the jailer user must have permissions to create them (can be done with cap_dac_override). (b) The jailer binary must have the capabilities mentioned in the previous comment (cap_sys_admin,cap_mknod). (c) The user/group id specified to run firecracker as must be the same as the user jailer is run as; otherwise firecracker does not have permission to chmod the api.socket for vm communication.

While items (a)+(b) are manageable, item (c) is a bit restrictive since it maybe the case that consumers want to run the jailer and firecracker as separate user IDs.

AlexandruCihodaru commented 3 years ago

We moved priority to low, because this is something that will make jailer safer to use by default but does not change threat model of Firecracker

alsrdn commented 2 years ago

some more discussion around this issue can be found here: https://github.com/firecracker-microvm/firecracker/issues/2785

kiranchandrasekhar commented 1 year ago

Hello, I am representing a group of undergraduate UT Austin students aiming to contribute to open-source virtualization projects on behalf of a virtualization course, and this looks like a good issue we could look into. Could we have this issue assigned to us?

firecracker-microvm / firecracker

Investigate running the jailer with reduced set of capabilities #1190