Open alexandruag opened 5 years ago
Investigated and found the following:
CAP_SYS_ADMIN and CAP_MKNOD capabilities are enough for running jailer without being superuser and without the flag --daemonize (CAP_SYS_ADMIN is needed for setns call from jailer/src/env.rs [1], and CAP_MKNOD for mknod calls from the same file [2])
the flag --daemonize leads to a setsid call, in the context of daemonizing the jailer, before executing firecracker as a child. The setsid call, as it stands in the documentation [3], has to be done by a process which has its PGID != PID, and which has no terminal attached. As it is right now, the jailer process is not entirely daemonized, because, by definition, a daemon process should have init as its parent, and the jailer process, started by a process different than init (e.g. bash shell), will not have init as its parent while its parent process does not call exit [*].
An interesting use case is that when starting the jailer using a session leader process which holds a tty or pts. Every new process started by the session leader (e.g a shell with job control -signaling, send to background, bring to foreground, killing entire process groups), either foreground process or background process, will be the owner of its process group, and it will have PID equal to PGID. [4]
This interferes with the way the jailer process becomes daemon in the current implementation of the jailer, because it tries to setsid when PID is equal with PGID, in the context of a non-superuser, and this leads to error. To change the PGID of a process, to make it different than PID, we can use setpgid [5]. However, setpgid comes with some restrictions, which makes it hard to change the PGID of the jailer process (for example, the new PGID has to be an existing process group from the same session group).
[*] Important notes for starting a daemon is closing standard input, output and err, and changing the session id, but all of these after a fork, in the child process. The parent process should make an exit call after forking, which leads to init picking up the orphaned forked process.
[1] http://man7.org/linux/man-pages/man2/setns.2.html [2] http://man7.org/linux/man-pages/man2/mknod.2.html [3] https://linux.die.net/man/2/setsid [4]http://poincare.matf.bg.ac.rs/~ivana/courses/ps/sistemi_knjige/pomocno/apue/APUE/0201433079/ch09lev1sec6.html [5] http://man7.org/linux/man-pages/man2/setpgid.2.html
To ship this in production, we could give jailer
the necessary capabilities as the last step in devtool build
. As easy as
sudo setcap CAP_MKNOD,CAP_SYS_ADMIN+ep $cargo_bin_dir/jailer
The problems start to appear when adapting the integration tests. Now, the main reason why the integ tests are run as root is because jailer
needs it. Eliminating this need, we should also "downgrade" pytest
to run as an unprivileged user. However, pytest
doesn't use the binaries generated by devtool build
, instead it builds its own. The setcap
invocation should be added here.
But: pytest
is run inside the container, which would no longer be privileged, so it couldn't do setcap
. Furthermore, the test setup often implies creating tap devices, which also wouldn't work unprivileged. We'd need to prepend each privilege-hungry command with sudo
.
But: we don't have sudo
in the container. Of course, we could add it in the Docker image, and add an unprivileged user to the image as well, and in the sudoers group.
Final but: the artefacts generated via the container (built binaries, caches, anything else) are currently owned by the user who launches devtool
, and that's terribly convenient. If we were to chown
everything inside the container with a different uid/gid, all these artefacts would be removed from the ownership of the user who triggered the build. That's cumbersome.
@firecracker-microvm/compute-capsule what are your thoughts on this conundrum?
There doesn't appear to be much traction on this topic, but a few notes based on my recent experience while getting jailer/firecracker to run as non-root:
(a) The user jailer is run as must have access to the firecracker cgroups (/sys/fs/cgroup/cpu/firecracker
, /sys/fs/cgroup/pids/firecracker
and /sys/fs/cgroup/cpuset/firecracker
) or if these cgroups don't exist the jailer user must have permissions to create them (can be done with cap_dac_override
).
(b) The jailer binary must have the capabilities mentioned in the previous comment (cap_sys_admin,cap_mknod
).
(c) The user/group id specified to run firecracker as must be the same as the user jailer is run as; otherwise firecracker does not have permission to chmod the api.socket for vm communication.
While items (a)+(b) are manageable, item (c) is a bit restrictive since it maybe the case that consumers want to run the jailer and firecracker as separate user IDs.
We moved priority to low, because this is something that will make jailer safer to use by default but does not change threat model of Firecracker
some more discussion around this issue can be found here: https://github.com/firecracker-microvm/firecracker/issues/2785
Hello, I am representing a group of undergraduate UT Austin students aiming to contribute to open-source virtualization projects on behalf of a virtualization course, and this looks like a good issue we could look into. Could we have this issue assigned to us?
We currently start the
jailer
as the superuser (i.e. usingsudo
), and rely on the fact the process will deprivilege itself beforeexec
-ing into Firecracker. It would be interesting to know if we can run thejailer
using a more restricted set of capabilities instead of full superuser mode.