privilege escalation via ptrace (CVE-2016-8659)

smcv commented 8 years ago

Sebastian Krahmer reported this to the oss-security mailing list.

There is a beautiful, easy to exploit, logical bug within the
bubblewrap program, thats part of the flatpak container-app framework,
but also used with other container solutions.

/usr/bin/bwrap may be installed mode 04755 or with cap_sys_admin and other
file caps. I dont know if there are any dists already shipping it that way,
but the Makefile and some RedHat spec files contain file caps for it.

bubblewrap's aim is to setup a container and seccomp sandbox for programs to be run
as user.

For some reason it sets the PR_SET_DUMPABLE flag, as seen below. The comment about
it looks strange to me. If thats really true, suid programs shouldn't
be forced to play with the dumpable flag to achieve their goal.

Once the dumpable flag is set, there is a chance we could attach to the process,
once the remaining caps are dropped and the whole process runs as user.

Luckily, that happens at line 1707, right after a PrivSep socket has been opened!

Once attached to the (now running as unprived user) process, we can inject
commands into that socket. We could do arbitrary mounts, but won't achieve much, since
the bwrap process is running in its own mount namespace. However, there is
a sethostname() OP, that we can use to affect the hostname of the entire system
(not restricted to UTS namespace). Now, just wait for root or other users to
login and execute bash to use one of the PS expansion bugs to execute code.

(I wonder that has been re-discovered recently, it was already part of the
CVE-2011-0966 attack vector.)

 383 acquire_caps (void)
 384 {

[...]

 422   /* We need the process to be dumpable, or we can't access /proc/self/uid_map */
 423   if (prctl (PR_SET_DUMPABLE, 1, 0, 0, 0) < 0)
 424     die_with_error ("prctl(PR_SET_DUMPABLE) failed");
 425 }

[...]

1422 int
1423 main (int    argc,
1424       char **argv)
1425 {

[...]

1440   /* Get the (optional) capabilities we need, drop root */
1441   acquire_caps ();

[...]

1692   if (is_privileged)
1693     {
1694       pid_t child;
1695       int privsep_sockets[2];
1696
1697       if (socketpair (AF_UNIX, SOCK_SEQPACKET | SOCK_CLOEXEC, 0, privsep_sockets) != 0)
1698         die_with_error ("Can't create privsep socket");
1699
1700       child = fork ();
1701       if (child == -1)
1702         die_with_error ("Can't fork unprivileged helper");
1703
1704       if (child == 0)
1705         {
1706           /* Unprivileged setup process */
1707           drop_caps ();                                                            // BOOM
1708           close (privsep_sockets[0]);
1709           setup_newroot (opt_unshare_pid, privsep_sockets[1]);
1710           exit (0);
1711         }
1712       else
1713         {
1714           uint32_t buffer[2048];  /* 8k, but is int32 to guarantee nice alignment */
1715           uint32_t op, flags;
1716           const char *arg1, *arg2;
1717           cleanup_fd int unpriv_socket = -1;
1718
1719           unpriv_socket = privsep_sockets[0];
1720           close (privsep_sockets[1]);
1721
1722           do
1723             {
1724               op = read_priv_sec_op (unpriv_socket, buffer, sizeof (buffer),
1725                                      &flags, &arg1, &arg2);
1726               privileged_op (-1, op, flags, arg1, arg2);
1727               if (write (unpriv_socket, buffer, 1) != 1)
1728                 die ("Can't write to op_socket");
1729             }
1730           while (op != PRIV_SEP_OP_DONE);
1731
1732           /* Continue post setup */
1733         }
1734     }

smcv commented 8 years ago

As a short-term fix for Debian, I've reverted #94 so there is no way to induce the privileged process to issue the sethostname() syscall, but I suspect a better long-term answer is to isolate the use of PR_SET_DUMPABLE to precisely where it is needed.

For some reason it sets the PR_SET_DUMPABLE flag, as seen below. The comment about it looks strange to me. If thats really true, suid programs shouldn't be forced to play with the dumpable flag to achieve their goal.

I assume the developers of Bubblewrap wouldn't have done this if the kernel (or at least a kernel they care about) didn't require it. But hopefully becoming dumpable can be restricted to smaller sections of the code.

If it's only for write_uid_gid_map(), then one way would be for that function to fork(), with the child making itself dumpable and writing the map files, and the parent just waiting for the child? ... I believe the intention is that none of the operations that can pass over the privilege separation socket are problematic, because they only affect the sandboxed processes, and if you can ptrace bwrap then you can certainly ptrace and subvert the sandboxed processes too. SETUP_SET_HOSTNAME clearly doesn't match that intention, if Bubblewrap didn't unshare the UTS namespace beforehand; Bubblewrap does insist that --hostname can't be used without --unshare-uts, but you're right that a user outside the sandbox can ptrace it and bypass that check by making it behave as though --hostname had been used.

As a quick temporary fix for Debian, I'm reverting the addition of SETUP_SET_HOSTNAME.

Am I right in thinking that this pseudocode would fix it while reinstating the feature?
case PRIV_SEP_OP_SET_HOSTNAME:
  if (!opt_unshare_uts)
    die_with_error ("Refusing to set hostname in original namespace");
  else if (!sethostname (... as before))
    die_with_error (... as before);

alexlarsson commented 8 years ago

The check described above should be fine for that exact exploit, but we should also try to figure out the minimal part that needs DUMPABLE.

alexlarsson commented 8 years ago

So, for the dumpable, if you remove the code that sets this, then you get:

 bwrap --unshare-user  --bind / / sh
setting up uid map: Permission denied

Because its trying to write to /proc/self/uid_map, which is root owned for a non-dumpable app.

cgwalters commented 8 years ago

See https://github.com/projectatomic/bubblewrap/pull/109

alexlarsson commented 8 years ago

So, here is the issue. When we're using user namespaces we need to have the privileged parent set the uid mapping in the child user namespace, because there is not necessarily a 1-1 mapping. However, the child, as well as the parent are non-dumpable due to the initial setuid bit.

Non-dumpable processes have uid 0 owning /proc/pid/' files though, so the parent (running as the user uid) isn't allowed to set the uid map unless the child is made dumpable.

However, this is only needed in the case when we need user namespaces. If we're not, then being non-dumpable is fine. So, my solution is to only make the child non-dumpable in the case of using user namespaces.

But, i hear you, then we'll be able to ptrace the child in the case of using user namespaces, causing this bug to re-appear! That is not true though, because in a user namespace we lose all capabilities in the parent namespace, so we will not be able to do something affecting the host namespace anyway (i.e. sethostname gets EPERM unless you unshared UTS ns). In fact, there is not much we can do here, because any process in the parent user namespace (i.e. the regular host namespace) have all caps in the user namespace, including CAP_SYS_PTRACE, so we override the value of dumpable anyway.

alexlarsson commented 8 years ago

Here is my approach instead: https://github.com/projectatomic/bubblewrap/pull/110

cgwalters commented 8 years ago

Let's call #109 the "fsuid" patch and #110 the "move-dumpable" patch.

I'd argue that the fsuid patch is a lot shorter, and more obviously correct than move-dumpable. Why is the latter better?

alexlarsson commented 8 years ago

Because the fsuid patch gives somewhat increases privileges (fsuid 0) which are not needed.

alexlarsson commented 8 years ago

In particular it is doing file writes as uid 0.

cgwalters commented 8 years ago

So the risk with fsuid is things like - what if somehow the pid got reused and we happened to write to the uid map for another pid? Another aspect is that suddenly we're trusting /proc to be the real /proc. But the latter is true of all setuid binaries really.

The risk with dumpable is - now all of a sudden we have to be concerned about the fact that someone could ptrace us. And yes, we are in a user namespace...but in fact, your argument:

In fact any process from the parent user
namespace have ptrace access anyway, due to parent ns having
CAP_SYS_PTRACE (and all other caps) in the child ns.

appears to me to be wrong, because the dumpable flag isn't set, and we need that and the capability to ptrace. If I add in to your patch:

diff --git a/bubblewrap.c b/bubblewrap.c
index 5c7999b..06bc029 100644
--- a/bubblewrap.c
+++ b/bubblewrap.c
@@ -1652,11 +1652,14 @@ main (int    argc,

   if (is_privileged && opt_unshare_user)
     {
+      fprintf (stderr, "** BEFORE SETTING DUMPABLE; pid=%u **\n", getpid());
+      sleep (15);
       /* We have to be dumpable for the parent to be able to set the
          uid map for us. This enables ptracing for the child, but that
          is not really a major*/
       if (prctl (PR_SET_DUMPABLE, 1, 0, 0, 0) < 0)
         die_with_error ("prctl(PR_SET_DUMPABLE) failed");
+      fprintf (stderr, "** AFTER SETTING DUMPABLE **\n");

       /* Let parent continue to set the uid map */
       val = 1;

And try to strace before setting dumpable, I get EPERM as expected.

So the aspect of being in a user namespace obviously mitigates things a lot...but we need to consider risks such as having an open file descriptor outside of the namespace. I doubt we have anything open that's writable, but we'd have to be careful in the future.

If you're arguing that we can enable dumpable, it follows that everything else after that could be arbitrary user code, and we could do:

--- a/bubblewrap.c
+++ b/bubblewrap.c
@@ -1728,7 +1728,7 @@ main (int    argc,
   if (chdir ("/") != 0)
     die_with_error ("chdir / (base path)");

-  if (is_privileged)
+  if (is_privileged && !opt_unshare_user)
     {
       pid_t child;
       int privsep_sockets[2];

Or actually, hypothetically:

@@ -1807,7 +1807,8 @@ main (int    argc,
     die_with_error ("chdir /");

   /* Now we have everything we need CAP_SYS_ADMIN for, so drop it */
-  drop_caps ();
+  if (opt_unshare_user && !opt_retain_caps)
+    drop_caps ();

   if (opt_block_fd != -1)
     {

Which I think is where some of the patches in https://github.com/projectatomic/bubblewrap/pull/101 are going. Basically, in the userns path (whether privileged or not) we rely solely on the kernel for security post-CLONE_NEWUSER.

cgwalters commented 8 years ago

Here's a good way to compare the patches - on kernels vulnerable to https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-3135 (the one we cite in README.md as an example userns vulnerability) - I'm pretty sure your patch exposes the vulnerability, since a caller could inject code to do the iptables syscalls, even though we don't expose that in the binary. But mine does not, because we drop CAP_NET_ADMIN (really all caps) in the namespace before executing the user code.

Basically, your patch makes bubblewrap's security equivalent to that of the full kernel user namespace feature set, whereas mine keeps things restricted to what we expose on the bwrap command line.

(But on the other hand, exposing the full userns feature set is where #101 is going)

smcv commented 8 years ago

Basically, your patch makes bubblewrap's security equivalent to that of the full kernel user namespace feature set, whereas mine keeps things restricted to what we expose on the bwrap command line. ... (But on the other hand, exposing the full userns feature set is where #101 is going)

This sounds like it's heading towards a sysadmin configuration mechanism for "how much exposure to kernel userns bugs am I willing to tolerate?"...

on kernels vulnerable to https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-3135 (the one we cite in README.md as an example userns vulnerability) - I'm pretty sure your patch exposes the vulnerability, since a caller could inject code to do the iptables syscalls, even though we don't expose that in the binary

This sounds to me like a good argument for taking the fsuid approach, at least short-term.

alexlarsson commented 8 years ago

@cgwalters Your test case is wrong, because getpid() is cached so returns the old value Try it again with this instead: fprintf (stderr, "\ BEFORE SETTING DUMPABLE; pid=%u real pid=%ld **\n", getpid(), syscall(SYS_getpid)); sleep (15);

alexlarsson commented 8 years ago

So, yeah, i get your point, but even with the fsuid approach we are ptraceable.

However, i mentioned this to @eparis and eric biederman, and he said:

ebiederm: alex, eparis That corner case where we let user namespace caps override dumpable looks buggy. I am not certain how that got in there.

So, maybe this will change over time.

cgwalters commented 8 years ago

Looks like it changed in http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8409cca7056113bee3236cb6a8e4d8d4d1eef102

alexlarsson commented 8 years ago

Maybe the end result is that we just can't expose user namespaces securely (above full kernel access to user namespaces) at all?

cgwalters commented 8 years ago

OK, let's just paper over this rabbit hole for now and try to replace it after the release with something stronger. I'm fine going with #110 - I'll close #109 .

cgwalters commented 8 years ago

Closed via #110

cgwalters commented 8 years ago

https://github.com/projectatomic/bubblewrap/releases/tag/v0.1.3

containers / bubblewrap

privilege escalation via ptrace (CVE-2016-8659) #107