containers / buildah

A tool that facilitates building OCI images.
https://buildah.io
Apache License 2.0
7.41k stars 782 forks source link

buildah unshare cannot create user namespace but unshare can #5056

Open Syquel opened 1 year ago

Syquel commented 1 year ago

Discussed in https://github.com/containers/buildah/discussions/5053

Originally posted by **Syquel** September 18, 2023 I am trying to run buildah within a podman container within WSL to simulate a capability-restricted environment. In my scenario the command `buildah unshare` is not able to create a new user namespace to mount a container, but `unshare` is. ## WSL Environment ```sh # uname -a Linux f6fc426fc67a 5.15.90.4-microsoft-standard-WSL2 #1 SMP Tue Jul 18 21:28:32 UTC 2023 x86_64 GNU/Linux ``` ```sh # podman --version podman version 4.6.2 ``` ## Container Environment The podman container is started via: ```sh # podman run --rm -it --cap-drop ALL --cap-add CAP_SETFCAP,CAP_SETUID,CAP_SETGID,CAP_DAC_OVERRIDE --security-opt no-new-privileges --device /dev/fuse registry.fedoraproject.org/fedora-minimal:39 /bin/bash ``` After installing `buildah` and `fuse-overlayfs` in the container: ```sh # buildah --version buildah version 1.32.0 (image-spec 1.1.0-rc.4, runtime-spec 1.1.0) ``` ```sh # fuse-overlayfs --version fuse-overlayfs: version 1.12 FUSE library version 3.16.1 using FUSE kernel interface version 7.38 fusermount3 version: 3.16.1 ``` ## Running Buildah The new container is created via: ```sh # buildah from scratch working-container ``` ### Without User Namespace Trying to mount the container fails as expected without a new user namespace: ```sh # buildah mount --log-level debug --storage-opt overlay.mount_program=/usr/bin/fuse-overlayfs working-container DEBU[0000] [graphdriver] trying provided driver "overlay" DEBU[0000] overlay: mount_program=/usr/bin/fuse-overlayfs Error: mount /var/lib/containers/storage/overlay:/var/lib/containers/storage/overlay, flags: 0x1000: operation not permitted DEBU[0000] [graphdriver] trying provided driver "overlay" DEBU[0000] overlay: mount_program=/usr/bin/fuse-overlayfs WARN[0000] failed to shutdown storage: "mount /var/lib/containers/storage/overlay:/var/lib/containers/storage/overlay, flags: 0x1000: operation not permitted" ``` ### Buildah Unshare Trying to create a new user namespace via `buildah unshare` fails unexpectedly: ```sh # buildah --log-level debug unshare --mount working-container DEBU[0000] effective capabilities: [audit_control=false audit_read=false audit_write=false block_suspend=false bpf=false checkpoint_restore=false chown=false dac_override=false dac_read_search=false fowner=false fsetid=false ipc_lock=false ipc_owner=false kill=false lease=false linux_immutable=false mac_admin=false mac_override=false mknod=false net_admin=false net_bind_service=false net_broadcast=false net_raw=false perfmon=false setfcap=true setgid=true setpcap=false setuid=true sys_admin=false sys_boot=false sys_chroot=true sys_module=false sys_nice=false sys_pacct=false sys_ptrace=false sys_rawio=false sys_resource=false sys_time=false sys_tty_config=false syslog=false wake_alarm=false] DEBU[0000] Running [buildah-in-a-user-namespace --log-level debug unshare --mount working-container] with environment [HOSTNAME=f6fc426fc67a DISTTAG=f39container PWD=/build container=oci HOME=/root FGC=f39 TERM=xterm SHLVL=1 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin _=/usr/bin/buildah TMPDIR=/var/tmp _CONTAINERS_USERNS_CONFIGURED=1 BUILDAH_ISOLATION=rootless], UID map [{ContainerID:0 HostID:0 Size:4294967295}], and GID map [{ContainerID:0 HostID:0 Size:4294967295}] DEBU[0000] effective capabilities: [audit_control=true audit_read=true audit_write=true block_suspend=true bpf=true checkpoint_restore=true chown=true dac_override=true dac_read_search=true fowner=true fsetid=true ipc_lock=true ipc_owner=true kill=true lease=true linux_immutable=true mac_admin=true mac_override=true mknod=true net_admin=true net_bind_service=true net_broadcast=true net_raw=true perfmon=true setfcap=true setgid=true setpcap=true setuid=true sys_admin=true sys_boot=true sys_chroot=true sys_module=true sys_nice=true sys_pacct=true sys_ptrace=true sys_rawio=true sys_resource=true sys_time=true sys_tty_config=true syslog=true wake_alarm=true] DEBU[0000] Running [buildah-in-a-user-namespace-in-a-user-namespace --log-level debug unshare --mount working-container] with environment [HOSTNAME=f6fc426fc67a DISTTAG=f39container PWD=/build container=oci HOME=/root FGC=f39 TERM=xterm SHLVL=1 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin _=/usr/bin/buildah TMPDIR=/var/tmp BUILDAH_ISOLATION=rootless _CONTAINERS_USERNS_CONFIGURED=1 _CONTAINERS_ROOTLESS_UID=0 _CONTAINERS_ROOTLESS_GID=0], UID map [{ContainerID:0 HostID:0 Size:1} {ContainerID:1 HostID:100000 Size:655361}], and GID map [{ContainerID:0 HostID:0 Size:1} {ContainerID:1 HostID:100000 Size:655361}] DEBU[0000] effective capabilities: [audit_control=true audit_read=true audit_write=true block_suspend=true bpf=true checkpoint_restore=true chown=true dac_override=true dac_read_search=true fowner=true fsetid=true ipc_lock=true ipc_owner=true kill=true lease=true linux_immutable=true mac_admin=true mac_override=true mknod=true net_admin=true net_bind_service=true net_broadcast=true net_raw=true perfmon=true setfcap=true setgid=true setpcap=true setuid=true sys_admin=true sys_boot=true sys_chroot=true sys_module=true sys_nice=true sys_pacct=true sys_ptrace=true sys_rawio=true sys_resource=true sys_time=true sys_tty_config=true syslog=true wake_alarm=true] DEBU[0000] Running [buildah-in-a-user-namespace-in-a-user-namespace-in-a-user-namespace --log-level debug unshare --mount working-container] with environment [HOSTNAME=f6fc426fc67a DISTTAG=f39container PWD=/build container=oci HOME=/root FGC=f39 TERM=xterm SHLVL=1 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin _=/usr/bin/buildah TMPDIR=/var/tmp BUILDAH_ISOLATION=rootless _CONTAINERS_USERNS_CONFIGURED=1 _CONTAINERS_ROOTLESS_UID=0 _CONTAINERS_ROOTLESS_GID=0], UID map [{ContainerID:0 HostID:0 Size:1} {ContainerID:1 HostID:100000 Size:655361}], and GID map [{ContainerID:0 HostID:0 Size:1} {ContainerID:1 HostID:100000 Size:655361}] Error: writing "0 0 1\n1 100000 655361\n" to /proc/165/gid_map: write /proc/165/gid_map: operation not permitted ERRO[0000] writing "0 0 1\n1 100000 655361\n" to /proc/165/gid_map: write /proc/165/gid_map: operation not permitted ERRO[0000] (Unable to determine exit status) DEBU[0000] exit status 1 DEBU[0000] exit status 1 ``` ### Unshare But creating a new user namespace via `unshare` works: ```sh # unshare --user --map-auto --map-root-user --mount buildah mount --log-level debug --storage-opt overlay.mount_program=/usr/bin/fuse-overlayfs working-container DEBU[0000] [graphdriver] trying provided driver "overlay" DEBU[0000] overlay: mount_program=/usr/bin/fuse-overlayfs DEBU[0000] backingFs=overlayfs, projectQuotaSupported=false, useNativeDiff=false, usingMetacopy=false DEBU[0000] Normalized platform linux/amd64 to {amd64 linux [] } DEBU[0000] overlay: mount_data=lowerdir=/var/lib/containers/storage/overlay/b8fe927025d03a91347333e2229a980e549a5666081acaf21f70017b9122a53f/empty,upperdir=/var/lib/containers/storage/overlay/b8fe927025d03a91347333e2229a980e549a5666081acaf21f70017b9122a53f/diff,workdir=/var/lib/containers/storage/overlay/b8fe927025d03a91347333e2229a980e549a5666081acaf21f70017b9122a53f/work,,volatile /var/lib/containers/storage/overlay/b8fe927025d03a91347333e2229a980e549a5666081acaf21f70017b9122a53f/merged DEBU[0000] shutting down the store ``` ## GID / UID Mappings The following subuids / subgids are defined: ```sh # cat /etc/subuid /etc/subgid root:100000:655361 root:100000:655361 ``` `unshare` creates the following uid / gid mappings: ```sh # unshare --user --map-auto --map-root-user --mount cat /proc/self/uid_map /proc/self/gid_map 0 0 1 1 100000 655360 0 0 1 1 100000 655360 ``` And according to the `buildah unshare` error message it tries to create the gid mappings `0 0 1` and `1 100000 655361`: ```sh # buildah unshare --mount working-container Error: writing "0 0 1\n1 100000 655361\n" to /proc/263/gid_map: write /proc/263/gid_map: operation not permitted ``` If I interpret that correctly `buildah unshare` has an off-by-one error and the writing to the gid_map is rejected because it tries to map more gids than there are subgids? ## Question My question is if I misunderstand / misuse `buildah unshare`, if this is intended behavior or if this is a bug. My expectation has been that the `unshare` command above would have the same result as `buildah unshare`.
flouthoc commented 1 year ago

Hi @Syquel , It seems there is question unanswered in original discussion, could you continue discussion there before opening a issue. If there is a consensus in discussion that this is a bug, then please feel free to re-open.

Syquel commented 1 year ago

I converted the discussion to an issue because I am sure that this behavior is not intended.

I don't see any relevant open questions in the discussion only one which asked something which I extensively described at the beginning of the discussion.

flouthoc commented 1 year ago

@Syquel Sure we can reopen this issue, i saw this question un-answered https://github.com/containers/buildah/discussions/5053#discussioncomment-7047374

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.

Syquel commented 8 months ago

Had some time to dig further through the source code.

Buildah calls unshare.MaybeReexecUsingUserNamespace(true):
https://github.com/containers/buildah/blob/540a73296f945aacbfaed40fa98e2fe86c7e52ac/cmd/buildah/unshare.go#L106C10-L106C39

This method is defined in containers/storage:
https://github.com/containers/storage/blob/91725e06f6f8eb4d1115fba7348c2e9b03874225/pkg/unshare/unshare_linux.go#L475

Re-exec and appending of -in-a-user-namespace to the command name occur here:
https://github.com/containers/storage/blob/91725e06f6f8eb4d1115fba7348c2e9b03874225/pkg/unshare/unshare_linux.go#L549

There are only two points where this method returns without re-execing the current executable:

  1. In case we are already root (os.Geteuid() == 0) and the UID in the parent user namespace is not root (GetRootlessUID() > 0).
    Here we don't return because I was already root in the parent namespace in the podman container (just without the necessary capabilities like CAP_SYS_ADMIN).
    https://github.com/containers/storage/blob/91725e06f6f8eb4d1115fba7348c2e9b03874225/pkg/unshare/unshare_linux.go#L477-L479
  2. In case we are already root (uidNum == 0), the evenForRoot flag is not set and we have the capability CAP_SYS_ADMIN.
    Here we don't return because Buildah explicitly sets the evenForRoot flag to true, when calling unshare.MaybeReexecUsingUserNamespace(true).
    https://github.com/containers/storage/blob/91725e06f6f8eb4d1115fba7348c2e9b03874225/pkg/unshare/unshare_linux.go#L531-L533

So the issue seems to be that there is a recursive loop of buildah unshare calling unshare.MaybeReexecUsingUserNamespace(true) calling buildah unshare again without any applicable terminating condition.

I am not sure what the correct fix would be but I see the following possibilities:

  1. Buildah should not set the evenForRoot flag unconditionally, but based on the result of unshare.isRootless():
    https://github.com/containers/storage/blob/91725e06f6f8eb4d1115fba7348c2e9b03874225/pkg/unshare/unshare_linux.go#L409
  2. Buildah should check the environment variable _CONTAINERS_USERNS_CONFIGURED:
    https://github.com/containers/storage/blob/main/pkg/unshare/unshare_linux.go#L386
  3. containers/storage should check the environment variable _CONTAINERS_USERNS_CONFIGURED in unshare.MaybeReexecUsingUserNamespace(bool):
    https://github.com/containers/storage/blob/91725e06f6f8eb4d1115fba7348c2e9b03874225/pkg/unshare/unshare_linux.go#L477-L479

The third option seems to be the most clean solution, but I have no idea whether that would impact other use cases.