apyrgio commented 2 years ago

Parent issue: https://github.com/freedomofpress/dangerzone/issues/221

User namespaces are very important, since they ensure that:

Root within the container maps to the parent user outside the container.
Users within the container map to non-existing users outside the container.

By ensuring that the user within the container (dangerzone, UID 1000) maps to a non-existing user outside the container, we complicate the attacker significantly. The current situation is:

On Linux, we don't use user namespaces fully, since we run containers with --userns keep-id, which makes the dangerzone user within the container have the same UID as the user outside the container.
On Windows/MacOS, they don't support user namespaces (see https://github.com/docker/for-win/issues/6897 and https://github.com/docker/for-mac/issues/3280 respectively).

Linux

Decide on a UID mapping (1000 inside the container, x > 1000 outside the container) before starting the container.
Create temporary directories for container I/O, owned by x > 1000 outside the container.
Copy in the source files to the temporary directory for the first container (will also fix https://github.com/freedomofpress/dangerzone/issues/157)
Run podman and specify the mapping for the container.
Copy out the converted files.

Windows/MacOS

Test Podman Desktop and check if it uses user namespaces.

Linux User Namespaces

References:

Linux User Namespaces got introduced in Linux Kernel 3.8. They look similar to PID namespaces, where PID 1 inside the namespace is mapped to a different PID outside the namespace. However, they are trickier than that, as they are also a namespace for user capabilities. Due to their sensitive nature, several OSes had disabled them years after their inclusion, until they reach a stable status.

Let's demystify them:

User namespaces are more than just namespaces for UIDs and GIDs. They are also a namespace for user capabilities (see capabilities(7)), i.e., what makes a user root. We won't touch on this subject here.

All users (unless restricted by system configuration) can create a user namespace. User namespaces are basically a mapping between UIDs/GIDs inside the namespace, and UIDs/GIDs outside the namespace:

<UID in namespace>  <UID in parent namespace>   <range>
<UID in namespace>  <UID in parent namespace>   <range>
...

Examples:

# UID 0 in the namespace maps to UID 0 in the parent namespace, UID 1000 in the
# namespace maps to UID 1000 in the parent namespace, and that's all.
0       0       1
1000    1000    1

# UID 0 in the namespace maps to UID 0 in the parent namespace, UID 1 in
# namespace maps to UID 1 in the parent namespace, and so forth up until UID
# 999 -> 999.
0       0       1000

# UID 0 in the namespace maps to UID 100000 in the parent namespace, UID 1 in
# namespace maps to UID 100001 in the parent namespace, and so forth up until
# UID 65535 -> 165535. This is a pretty typical configuration.
0       100000  65536

# UID 1001 in the namespace maps to UID 1000 in the parent namespace, and that's
all.
1001    1000    1

This mapping is available through /proc/self/{u,g}id_map. For the root namespace, this mapping is a dummy one (all UIDs in the namespace map to the same UIDs in the parent namespace), but for the created namespace, the mapping is empty by default:

$ cat /proc/self/{u,g}id_map
0          0 4294967295
0          0 4294967295
$ unshare -U cat /proc/self/{u,g}id_map

For user namespaces with empty mappings, we need to have some things in mind:

The Linux Kernel has an overflow UID (proc/sys/kernel/overflowuid), which by default is nobody/65534. If a user namespace has no mapping, all IDs in that namespace will show up as nobody.
Processes in that namespace inherit the UID of the user that started them in the parent namespace, even if they show up as nobody. This means that they can see the files that a user in the parent namespace can see.
Until a mapping exists, processes within that namespace cannot perform any UID action (e.g., chown), even though they have a UID of their own, because the kernel cannot translate it to a UID in the parent namespace.

This mapping is writable only by processes with sufficient rights, and only once (see user_namespaces(7)).

I think that the simplest mapping that can exist is just assigning a container UID to the user's UID in the parent namespace. Anything more than that essentially requires root permissions.

Once a mapping exists, then:

There can be a UID 0 process in that namespace.
Any UID/GID action that processes perform will be translated by the Linux Kernel, e.g., for fs permissions.

Rootless Podman and Linux User Namespaces

References:

Let's see how rootless Podman deals with user namespaces.

When Podman creates a new user namespace, it needs to assign a UID mapping to that. Since it's rootless though, it's not easy to do so, because it doesn't have the necessary capabilities. That's where new{u,g}idmap binaries come into play. They are setuid binaries (verify this with either ls -l $(which newuidmap) or getcap $(which newuidmap)) which consult the /etc/sub{u,g}id (which are writable only by root) files and assign the mapping. These files have a different format than /proc/self/uid_map:

<username/UID>:<start of subordinate UIDs>:<count>
<username/UID>:<start of subordinate UIDs>:<count>
...

Basically, they define the range of host UIDs (subordinate UIDs) that a user has at their disposal, when creating a container. A range like user:100000:65536 means that the user can specify a UID mapping in the container like 0 100000 65536.

If there are no /etc/sub{u,g}id files, then the default mapping is:

$ podman unshare cat /proc/self/uid_map
0       1000          1

That is, the root in the container maps to the user outside the container, which is the most Linux Kernel allows. If there are though (e.g., user:100000:65536), the default mapping is:

$ podman unshare cat /proc/self/uid_map
0       1000        1
1       100000      65536

Essentially, the root user in the container maps to the user outside the container, and every other UID in the container maps to UIDS >= 100000 in the host. Also note that Podman will create a single user namespace per container, so these mappings are shared between all rootless containers.

Podman has several options to control the mapping (see https://docs.podman.io/en/latest/markdown/podman-run.1.html#userns-mode). Let's see some in action:

# --userns="" (or no --userns passed)
$ podman run -it --rm docker.io/library/alpine:edge cat /proc/self/uid_map
0       1000          1
1     100000      65536

# --userns keep-id
$ podman run -it --rm --userns keep-id docker.io/library/alpine:edge cat /proc/self/uid_map
0           1           1000
1000        0           1
1001        1001        64536

In the first case, we see that the container root maps to the user it started the container, and all UIDs after that match the subordinate UIDs of the user in /etc/subuid.

In the second case, we notice something weird. The root of the container maps to host UID 1, and UID 1000 within the container maps to host UID

This is not the case of course. Podman uses intermediate UIDs, when it performs its own mapping. In practice, the second column stops becoming "host UID" and becomes "Nth subordinate UID". So if /etc/subuid contains user:100000:65536, the above can be translated to:

# --userns keep-id (translated)
$ podman run -it --rm --userns keep-id docker.io/library/alpine:edge cat /proc/self/uid_map
0           100000      1000   # root in the container maps to 1st subordinate UID (100000) up to 100999
1000        1000        1      # 1000 in the container maps to user in the host (1000)
1001        101000      64536  # 1001 in the container maps to 1000th subordinate UID (101000) up to 165535

To make translation easier, one can check the UID mapping from the parent namespace, where they'll get the proper values.

In the above examples, we see that either the root or the user within the container maps to the user outside the container (1000). We can circumvent this with --uidmap 0:1:65536 --gidmap 0:1:65536, which maps the root of the container to the 1st subordinate UID (e.g., 100000), and the rest of the UIDs follow suit. Alternatively, users can pass --userns nomap, but it's only present in recent versions.

Problems with insufficient UID/GID mappings will occur either when pulling an OCI image, or when creating a copy of a layer when attempting to run a container from an image.

apyrgio commented 1 year ago

Dangerzone and Linux User Namespaces

Now that we've seen how Linux User Namespaces work, and how Podman handles them, let's see how Dangerzone should handle them.

Requirements

We'll start with some requirements and how we can cover them for Dangerzone:

1. The user IDs within the Dangerzone container should not map to any user in the host

The reason is that we don't want any container escape to have any effect to the host. The escaped user should effectively be treated as nobody.

Best way to achieve this is to use --userns nomap. This will map all the UIDs in the container to the subordinate UIDs in the host (so root -> 100000, dangerzone -> 101000). This is not available in older Podman versions though, so we need mimic what it does in our code.

Podman's implementation can be found here: https://github.com/containers/podman/blob/67c533b85a80fd40228bedbca89a61912ca8a9a5/pkg/util/utils.go#L404. Basically, what Podman does is:

Read /etc/sub{u,g}id and get the ID ranges (subordinate UID, count). Remember that there can be more than one line for the same user.
Iterate these ranges and create a mapping that starts with UID 0 in the container -> 1st subordinate UID in the host, until it reaches the max number of allowed subordinate UIDs.

2. The files/folders mounted to the Dangerzone container should be accessible by UID/GID 1000 (`dangerzone`) within this container

We will take advantage of two facts:

The root in a user namespace can make actions on behalf of every UID in that namespace.
podman unshare maps the root of the user namespace to the user in the host.

This way, we can chown directories to the dangerzone user in the container, without being root in the host.

Note that the containers and the folders that are used in each step are:

Pre-conversion step:
- A temporary dir that will hold the artifacts for the whole conversion (e.g., tmp/)
- First container:
- File to get converted (e.g., ~/input_file)
- Directory that will hold the pixel data of the conversion (tmp/pixels/)
Second container:
- Directory that holds the pixel data of the previous conversion (tmp/pixels/)
- Directory that holds the final PDF (safe/)
Post-conversion step:
- Copy the converted file (tmp/safe/safe-output-compressed.pdf) to the destination that the user chose (e.g. ~/output_file)

Proposed Implementation

Create the temporary directory (e.g., tmp/) for the conversion process, and the necessary subdirectories, as usual.
Copy the file to be converted in the temporary directory.
Run podman unshare chown 1001:1001 tmp/*.
- This means that these files will be owned by the 1001st subordinate UID in the host.
- This UID will be UID 1000 in the actual container that will do the conversion process.
- From this point on, the user outside the container will not be able to affect the chown'ed files and dirs, unless they use podman unshare.
Get the number of subordinate UIDs using podman info.
- We must not read /etc/sub{u,g}id, because it may differ from the user namespace that Podman has already created (e.g., because the user changed it and forgot to run podman system migrate).
Run the rest of the Dangerzone containers with the following changes:
- Ditch --userns keep-id. We don't want this as it maps the user in the container to the user in the host.
- Use --uidmap 0:1:<num of sub UIDs> --gidmap 0:1:<num of sub GIDs>:
  - This means that root in the container will map to the 1st subordinate UID in the host, and dangerzone in the container will map to the 1001st subordinate UID in the host
- Mount the file to be converted in the container from the temporary director (e.g., tmp/input_file), instead of its original path.
  - Also fixes #157.
Copy the converted file to the destination that the user chose, as usual.

Implementation Details

An interesting side-effect of user namespaces is that we can mount tmpfs within that user namespace, which is not possible for the regular user in the host. This means that we can run podman unshare mount -t tmpfs tmpfs tmp/ in Step 1 and ensure that the sensitive file will never be written to the disk, during the conversion process at least.

apyrgio commented 4 months ago

We can close this issue once we merge #590, since gVisor will run rootless, and the host user will not be mapped to the inner container. As a bonus, we will remove the --userns keep-id flag from the outer container, and make sure to use --userns nomap in platforms that have Podman >= 4.1.

freedomofpress / dangerzone

Defense in Depth - User Namespaces #228

Linux

Windows/MacOS

Linux User Namespaces

Rootless Podman and Linux User Namespaces

Dangerzone and Linux User Namespaces

Requirements

1. The user IDs within the Dangerzone container should not map to any user in the host

2. The files/folders mounted to the Dangerzone container should be accessible by UID/GID 1000 (`dangerzone`) within this container

Proposed Implementation

Implementation Details

freedomofpress / dangerzone

Defense in Depth - User Namespaces #228

Linux

Windows/MacOS

Linux User Namespaces

Rootless Podman and Linux User Namespaces

Dangerzone and Linux User Namespaces

Requirements

1. The user IDs within the Dangerzone container should not map to any user in the host

2. The files/folders mounted to the Dangerzone container should be accessible by UID/GID 1000 (dangerzone) within this container

Proposed Implementation

Implementation Details

2. The files/folders mounted to the Dangerzone container should be accessible by UID/GID 1000 (`dangerzone`) within this container