Open beaufortfrancois opened 6 years ago
Thanks for this feature request!
I think that's because our Docker containers don't the CAP_SYS_ADMIN
capability, for security reasons.
This also prevents Firefox from running with a sandbox (which it apparently does in Debug mode, as @whimboo found out), and it also prevents us from using rr
in our containers.
I don't think we'll want to add CAP_SYS_ADMIN
to all Janitor containers (because this allows becoming root
on the host), but maybe we could grant it to certain trusted containers, on a case-by-base basis, to enable the valuable use cases listed above?
@notriddle what do you think?
Neither Firefox sandboxing nor Chromium's namespace sandbox should need capabilities in the namespace they're launched in (nor any enclosing namespace), but they do need to be able to create new user namespaces.
Normally this is allowed for unprivileged users, but there are concerns about it due to the possibility of exposing exploitable kernel bugs that unprivileged callers normally couldn't reach, so sandboxes usually block those system calls. That seems to be what's going on in Mozilla bug 1430756 — unshare(0)
is a no-op that's normally allowed unconditionally, but it fails.
Docker's documentation mentions a seccomp-bpf policy that would do this. It also links to the policy, in a JSON format, which mentions allowing the syscalls in question in connection with CAP_SYS_ADMIN
, and I think what's going on here is that the seccomp-bpf program varies based on the capabilities granted to the container. But, if I'm right about this, it should be possible to edit that profile to allow unshare
and clone
normally, without capabilities.
Yeah, unshare
and clone
are allowed without ADMIN. It's only setns
that requires a privileged container.
Thank you for these details! https://github.com/jessfraz/dockerfiles/issues/65#issuecomment-145731454 prompted me to consult man clone
, which seems to indicate that CAP_SYS_ADMIN is required for the following flags:
Also, man unshare
seems to indicate that some unshare options are associated to some clone flags (although I don't know if that means they need CAP_SYS_ADMIN to work or not):
I guess my questions here are:
Failed to move to new namespace: PID namespaces supported, Network namespace supported, but failed: errno = Operation not permitted
mean that it tried to unsuccessfully use setns
, clone
or unshare
?unshare
is really allowed without ADMIN, then why is Firefox's sandbox choking on "unshare nothing"? https://searchfox.org/mozilla-central/source/security/sandbox/linux/SandboxInfo.cpp#168 [0]unshare
and clone
for every container, without requiring CAP_SYS_ADMIN, or is this too dangerous from a security standpoint?[0] This Docker seccomp profile page linked by @jld mentions that for unshare
it will "Deny cloning new namespaces for processes. Also gated by CAP_SYS_ADMIN
, with the exception of unshare --user
." Maybe Firefox is using unshare
(gated by ADMIN) instead of unshare --user
(not gated)?
Random note, https://github.com/docker/docker-bench-security and Lynis can help us audit the security of our Docker configurations and dockerfiles.
Other random note, this Docker docs page says:
By default Docker drops all capabilities except those needed, a whitelist instead of a blacklist approach. You can see a full list of available capabilities in Linux manpages.
Now we just need to know which capabilities we need to grant to our containers to support gdb
, rr
and Firefox/Chromium namespace-changing sandboxes, and if granting them is reasonably secure, or if we should only grant them to a select few containers upon special request.
If docker image support namespace changes, I should be able to run Chrome with proper sandboxing.
Background thread: