Closed apyrgio closed 4 months ago
Since the subject of Linux user namespaces is very tricky, I'll dump here what I have understood so far. Hopefully this will help in the review process, or when we want to backtrack in case we've done a mistake.
References:
Linux User Namespaces got introduced in Linux Kernel 3.8. They look similar to PID namespaces, where PID 1 inside the namespace is mapped to a different PID outside the namespace. However, they are trickier than that, as they are also a namespace for user capabilities. Due to their sensitive nature, several OSes had disabled them years after their inclusion, until they reach a stable status.
Let's demystify them:
User namespaces are more than just namespaces for UIDs and GIDs. They are also a namespace for user capabilities (see capabilities(7)), i.e., what makes a user root. We won't touch on this subject here.
All users (unless restricted by system configuration) can create a user namespace. User namespaces are basically a mapping between UIDs/GIDs inside the namespace, and UIDs/GIDs outside the namespace:
<UID in namespace> <UID in parent namespace> <range>
<UID in namespace> <UID in parent namespace> <range>
...
Examples:
# UID 0 in the namespace maps to UID 0 in the parent namespace, UID 1000 in the
# namespace maps to UID 1000 in the parent namespace, and that's all.
0 0 1
1000 1000 1
# UID 0 in the namespace maps to UID 0 in the parent namespace, UID 1 in
# namespace maps to UID 1 in the parent namespace, and so forth up until UID
# 999 -> 999.
0 0 1000
# UID 0 in the namespace maps to UID 100000 in the parent namespace, UID 1 in
# namespace maps to UID 100001 in the parent namespace, and so forth up until
# UID 65535 -> 165535. This is a pretty typical configuration.
0 100000 65536
# UID 1001 in the namespace maps to UID 1000 in the parent namespace, and that's
all.
1001 1000 1
This mapping is available through /proc/self/{u,g}id_map
. For the root
namespace, this mapping is a dummy one (all UIDs in the namespace map to the
same UIDs in the parent namespace), but for the created namespace, the mapping
is empty by default:
$ cat /proc/self/{u,g}id_map
0 0 4294967295
0 0 4294967295
$ unshare -U cat /proc/self/{u,g}id_map
For user namespaces with empty mappings, we need to have some things in mind:
proc/sys/kernel/overflowuid
), which
by default is nobody/65534
. If a user namespace has no mapping, all IDs in
that namespace will show up as nobody
.nobody
. This means that they
can see the files that a user in the parent namespace can see.chown
), even though they have a UID of their own, because
the kernel cannot translate it to a UID in the parent namespace.This mapping is writable only by processes with sufficient rights, and only
once (see user_namespaces(7)
).
I think that the simplest mapping that can exist is just assigning a container UID to the user's UID in the parent namespace. Anything more than that essentially requires root permissions.
Once a mapping exists, then:
References:
Let's see how rootless Podman deals with user namespaces.
When Podman creates a new user namespace, it needs to assign a UID mapping to
that. Since it's rootless though, it's not easy to do so, because it doesn't
have the necessary capabilities. That's where new{u,g}idmap
binaries come into
play. They are setuid binaries (verify this with either ls -l $(which newuidmap)
or
getcap $(which newuidmap)
) which consult the /etc/sub{u,g}id
(which are writable only by root) files and assign the mapping. These files have
a different format than /proc/self/uid_map
:
<username/UID>:<start of subordinate UIDs>:<count>
<username/UID>:<start of subordinate UIDs>:<count>
...
Basically, they define the range of host UIDs (subordinate UIDs) that a user has
at their disposal, when creating a container. A range like user:100000:65536
means that the user can specify a UID mapping in the container like 0 100000 65536
.
If there are no /etc/sub{u,g}id
files, then the default mapping is:
$ podman unshare cat /proc/self/uid_map
0 1000 1
That is, the root in the container maps to the user outside the container, which
is the most Linux Kernel allows. If there are though (e.g.,
user:100000:65536
), the default mapping is:
$ podman unshare cat /proc/self/uid_map
0 1000 1
1 100000 65536
Essentially, the root user in the container maps to the user outside the container, and every other UID in the container maps to UIDS >= 100000 in the host. Also note that Podman will create a single user namespace per container, so these mappings are shared between all rootless containers.
Podman has several options to control the mapping (see https://docs.podman.io/en/latest/markdown/podman-run.1.html#userns-mode). Let's see some in action:
# --userns="" (or no --userns passed)
$ podman run -it --rm docker.io/library/alpine:edge cat /proc/self/uid_map
0 1000 1
1 100000 65536
# --userns keep-id
$ podman run -it --rm --userns keep-id docker.io/library/alpine:edge cat /proc/self/uid_map
0 1 1000
1000 0 1
1001 1001 64536
In the first case, we see that the container root maps to the user it started
the container, and all UIDs after that match the subordinate UIDs of the user in
/etc/subuid
.
In the second case, we notice something weird. The root of the container maps to host UID 1, and UID 1000 within the container maps to host UID
/etc/subuid
contains
user:100000:65536
, the above can be translated to:# --userns keep-id (translated)
$ podman run -it --rm --userns keep-id docker.io/library/alpine:edge cat /proc/self/uid_map
0 100000 1000 # root in the container maps to 1st subordinate UID (100000) up to 100999
1000 1000 1 # 1000 in the container maps to user in the host (1000)
1001 101000 64536 # 1001 in the container maps to 1000th subordinate UID (101000) up to 165535
To make translation easier, one can check the UID mapping from the parent namespace, where they'll get the proper values.
In the above examples, we see that either the root or the user within the
container maps to the user outside the container (1000). We can circumvent this
with --uidmap 0:1:65536 --gidmap 0:1:65536
, which maps the root of the
container to the 1st subordinate UID (e.g., 100000), and the rest of the UIDs follow
suit. Alternatively, users can pass --userns nomap
, but it's only present in
recent versions.
Problems with insufficient UID/GID mappings will occur either when pulling an OCI image, or when creating a copy of a layer when attempting to run a container from an image.
Now that we've seen how Linux User Namespaces work, and how Podman handles them, let's see how Dangerzone should handle them.
We'll start with some requirements and how we can cover them for Dangerzone:
The reason is that we don't want any container escape to have any effect to the host. The escaped user should effectively be treated as nobody
.
Best way to achieve this is to use --userns nomap
. This will map all the UIDs in the container to the subordinate UIDs in the host (so root
-> 100000
, dangerzone
-> 101000
). This is not available in older Podman versions though, so we need mimic what it does in our code.
Podman's implementation can be found here: https://github.com/containers/podman/blob/67c533b85a80fd40228bedbca89a61912ca8a9a5/pkg/util/utils.go#L404. Basically, what Podman does is:
/etc/sub{u,g}id
and get the ID ranges (subordinate UID, count). Remember that there can be more than one line for the same user.dangerzone
) within this containerWe will take advantage of two facts:
podman unshare
maps the root of the user namespace to the user in the host.This way, we can chown
directories to the dangerzone
user in the container, without being root
in the host.
Note that the containers and the folders that are used in each step are:
tmp/
)~/input_file
)tmp/pixels/
)tmp/pixels/
)safe/
)tmp/safe/safe-output-compressed.pdf
) to the destination that the user chose (e.g. ~/output_file
)tmp/
) for the conversion process, and the necessary subdirectories, as usual.podman unshare chown 1001:1001 tmp/*
.
podman unshare
.podman info
.
/etc/sub{u,g}id
, because it may differ from the user namespace that Podman has already created (e.g., because the user changed it and forgot to run podman system migrate
).--userns keep-id
. We don't want this as it maps the user in the container to the user in the host.--uidmap 0:1:<num of sub UIDs> --gidmap 0:1:<num of sub GIDs>
:
root
in the container will map to the 1st subordinate UID in the host, and dangerzone
in the container will map to the 1001st subordinate UID in the hosttmp/input_file
), instead of its original path.
An interesting side-effect of user namespaces is that we can mount tmpfs
within that user namespace, which is not possible for the regular user in the host. This means that we can run podman unshare mount -t tmpfs tmpfs tmp/
in Step 1 and ensure that the sensitive file will never be written to the disk, during the conversion process at least.
We can close this issue once we merge #590, since gVisor will run rootless, and the host user will not be mapped to the inner container. As a bonus, we will remove the --userns keep-id
flag from the outer container, and make sure to use --userns nomap
in platforms that have Podman >= 4.1.
Parent issue: https://github.com/freedomofpress/dangerzone/issues/221
User namespaces are very important, since they ensure that:
By ensuring that the user within the container (
dangerzone
, UID 1000) maps to a non-existing user outside the container, we complicate the attacker significantly. The current situation is:--userns keep-id
, which makes thedangerzone
user within the container have the same UID as the user outside the container.Linux
x > 1000
outside the container) before starting the container.x > 1000
outside the container.podman
and specify the mapping for the container.Windows/MacOS
Test Podman Desktop and check if it uses user namespaces.
Further reading: