Closed smcv closed 8 years ago
As a short-term fix for Debian, I've reverted #94 so there is no way to induce the privileged process to issue the sethostname()
syscall, but I suspect a better long-term answer is to isolate the use of PR_SET_DUMPABLE
to precisely where it is needed.
See also http://www.openwall.com/lists/oss-security/2016/10/13/4
For some reason it sets the PR_SET_DUMPABLE flag, as seen below. The comment about it looks strange to me. If thats really true, suid programs shouldn't be forced to play with the dumpable flag to achieve their goal.
I assume the developers of Bubblewrap wouldn't have done this if the kernel (or at least a kernel they care about) didn't require it. But hopefully becoming dumpable can be restricted to smaller sections of the code.
If it's only for write_uid_gid_map(), then one way would be for that function to fork(), with the child making itself dumpable and writing the map files, and the parent just waiting for the child? ... I believe the intention is that none of the operations that can pass over the privilege separation socket are problematic, because they only affect the sandboxed processes, and if you can ptrace bwrap then you can certainly ptrace and subvert the sandboxed processes too. SETUP_SET_HOSTNAME clearly doesn't match that intention, if Bubblewrap didn't unshare the UTS namespace beforehand; Bubblewrap does insist that --hostname can't be used without --unshare-uts, but you're right that a user outside the sandbox can ptrace it and bypass that check by making it behave as though --hostname had been used.
As a quick temporary fix for Debian, I'm reverting the addition of SETUP_SET_HOSTNAME.
Am I right in thinking that this pseudocode would fix it while reinstating the feature?
case PRIV_SEP_OP_SET_HOSTNAME: if (!opt_unshare_uts) die_with_error ("Refusing to set hostname in original namespace"); else if (!sethostname (... as before)) die_with_error (... as before);
The check described above should be fine for that exact exploit, but we should also try to figure out the minimal part that needs DUMPABLE.
So, for the dumpable, if you remove the code that sets this, then you get:
bwrap --unshare-user --bind / / sh
setting up uid map: Permission denied
Because its trying to write to /proc/self/uid_map, which is root owned for a non-dumpable app.
So, here is the issue. When we're using user namespaces we need to have the privileged parent set the uid mapping in the child user namespace, because there is not necessarily a 1-1 mapping. However, the child, as well as the parent are non-dumpable due to the initial setuid bit.
Non-dumpable processes have uid 0 owning /proc/pid/' files though, so the parent (running as the user uid) isn't allowed to set the uid map unless the child is made dumpable.
However, this is only needed in the case when we need user namespaces. If we're not, then being non-dumpable is fine. So, my solution is to only make the child non-dumpable in the case of using user namespaces.
But, i hear you, then we'll be able to ptrace the child in the case of using user namespaces, causing this bug to re-appear! That is not true though, because in a user namespace we lose all capabilities in the parent namespace, so we will not be able to do something affecting the host namespace anyway (i.e. sethostname gets EPERM unless you unshared UTS ns). In fact, there is not much we can do here, because any process in the parent user namespace (i.e. the regular host namespace) have all caps in the user namespace, including CAP_SYS_PTRACE, so we override the value of dumpable anyway.
Here is my approach instead: https://github.com/projectatomic/bubblewrap/pull/110
Let's call #109 the "fsuid" patch and #110 the "move-dumpable" patch.
I'd argue that the fsuid patch is a lot shorter, and more obviously correct than move-dumpable. Why is the latter better?
Because the fsuid patch gives somewhat increases privileges (fsuid 0) which are not needed.
In particular it is doing file writes as uid 0.
So the risk with fsuid is things like - what if somehow the pid got reused and we happened to write to the uid map for another pid? Another aspect is that suddenly we're trusting /proc
to be the real /proc
. But the latter is true of all setuid binaries really.
The risk with dumpable is - now all of a sudden we have to be concerned about the fact that someone could ptrace us. And yes, we are in a user namespace...but in fact, your argument:
In fact any process from the parent user
namespace have ptrace access anyway, due to parent ns having
CAP_SYS_PTRACE (and all other caps) in the child ns.
appears to me to be wrong, because the dumpable flag isn't set, and we need that and the capability to ptrace. If I add in to your patch:
diff --git a/bubblewrap.c b/bubblewrap.c
index 5c7999b..06bc029 100644
--- a/bubblewrap.c
+++ b/bubblewrap.c
@@ -1652,11 +1652,14 @@ main (int argc,
if (is_privileged && opt_unshare_user)
{
+ fprintf (stderr, "** BEFORE SETTING DUMPABLE; pid=%u **\n", getpid());
+ sleep (15);
/* We have to be dumpable for the parent to be able to set the
uid map for us. This enables ptracing for the child, but that
is not really a major*/
if (prctl (PR_SET_DUMPABLE, 1, 0, 0, 0) < 0)
die_with_error ("prctl(PR_SET_DUMPABLE) failed");
+ fprintf (stderr, "** AFTER SETTING DUMPABLE **\n");
/* Let parent continue to set the uid map */
val = 1;
And try to strace before setting dumpable, I get EPERM
as expected.
So the aspect of being in a user namespace obviously mitigates things a lot...but we need to consider risks such as having an open file descriptor outside of the namespace. I doubt we have anything open that's writable, but we'd have to be careful in the future.
If you're arguing that we can enable dumpable, it follows that everything else after that could be arbitrary user code, and we could do:
--- a/bubblewrap.c
+++ b/bubblewrap.c
@@ -1728,7 +1728,7 @@ main (int argc,
if (chdir ("/") != 0)
die_with_error ("chdir / (base path)");
- if (is_privileged)
+ if (is_privileged && !opt_unshare_user)
{
pid_t child;
int privsep_sockets[2];
Or actually, hypothetically:
@@ -1807,7 +1807,8 @@ main (int argc,
die_with_error ("chdir /");
/* Now we have everything we need CAP_SYS_ADMIN for, so drop it */
- drop_caps ();
+ if (opt_unshare_user && !opt_retain_caps)
+ drop_caps ();
if (opt_block_fd != -1)
{
Which I think is where some of the patches in https://github.com/projectatomic/bubblewrap/pull/101 are going. Basically, in the userns path (whether privileged or not) we rely solely on the kernel for security post-CLONE_NEWUSER
.
Here's a good way to compare the patches - on kernels vulnerable to https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-3135 (the one we cite in README.md
as an example userns vulnerability) - I'm pretty sure your patch exposes the vulnerability, since a caller could inject code to do the iptables syscalls, even though we don't expose that in the binary. But mine does not, because we drop CAP_NET_ADMIN
(really all caps) in the namespace before executing the user code.
Basically, your patch makes bubblewrap's security equivalent to that of the full kernel user namespace feature set, whereas mine keeps things restricted to what we expose on the bwrap command line.
(But on the other hand, exposing the full userns feature set is where #101 is going)
Basically, your patch makes bubblewrap's security equivalent to that of the full kernel user namespace feature set, whereas mine keeps things restricted to what we expose on the bwrap command line. ... (But on the other hand, exposing the full userns feature set is where #101 is going)
This sounds like it's heading towards a sysadmin configuration mechanism for "how much exposure to kernel userns bugs am I willing to tolerate?"...
on kernels vulnerable to https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-3135 (the one we cite in README.md as an example userns vulnerability) - I'm pretty sure your patch exposes the vulnerability, since a caller could inject code to do the iptables syscalls, even though we don't expose that in the binary
This sounds to me like a good argument for taking the fsuid approach, at least short-term.
@cgwalters Your test case is wrong, because getpid() is cached so returns the old value Try it again with this instead: fprintf (stderr, "\ BEFORE SETTING DUMPABLE; pid=%u real pid=%ld **\n", getpid(), syscall(SYS_getpid)); sleep (15);
So, yeah, i get your point, but even with the fsuid approach we are ptraceable.
However, i mentioned this to @eparis and eric biederman, and he said:
ebiederm: alex, eparis That corner case where we let user namespace caps override dumpable looks buggy. I am not certain how that got in there.
So, maybe this will change over time.
Maybe the end result is that we just can't expose user namespaces securely (above full kernel access to user namespaces) at all?
OK, let's just paper over this rabbit hole for now and try to replace it after the release with something stronger. I'm fine going with #110 - I'll close #109 .
Closed via #110
Sebastian Krahmer reported this to the oss-security mailing list.