[Rootless] --privileged does not grant SYS_ADMIN to non-UID-0

ubergeek77 commented 2 years ago

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

In rootless mode, --privileged does not grant SYS_ADMIN to the container unless the container is running as UID 0.

Steps to reproduce the issue:

Build an image with this Containerfile and podman build -t issue-demo .

FROM docker.io/ubuntu:20.04

RUN apt-get update && \
    apt-get install -y fuse2fs && \
    useradd -m demo

USER demo

WORKDIR /home/demo

RUN truncate -s 20M test.img && \
    mkfs.ext4 test.img && \
    mkdir -p mnt

Use --privileged and try to mount test.img as a non-root user with FUSE (which requires SYS_ADMIN). Observe that this fails:

$ podman run --rm -it --privileged --device /dev/fuse issue-demo
demo@8b901f739b3f:~$ fuse2fs -o ro test.img mnt/
Mounting read-only.
fuse: failed to exec fusermount: No such file or directory
demo@8b901f739b3f:~$ ls mnt
demo@8b901f739b3f:~$

Use --cap-add SYS_ADMIN and try to mount test.img as a non-root user with FUSE. Observe that this is successful:

$ podman run --rm -it --cap-add SYS_ADMIN --device /dev/fuse issue-demo
demo@162ab2946b21:~$ fuse2fs -o ro test.img mnt/
Mounting read-only.
demo@162ab2946b21:~$ ls mnt
lost+found
demo@162ab2946b21:~$

Use --privileged once more, but specify -u 0 to run the container as "root". Try to mount the image; observe that this is successful:

$ podman run --rm -it --privileged --device /dev/fuse -u 0 issue-demo
root@62e0f5540ad4:/home/demo# fuse2fs -o ro test.img mnt/
Mounting read-only.
root@62e0f5540ad4:/home/demo# ls mnt
lost+found
root@62e0f5540ad4:/home/demo#

Describe the results you received: Using the --privileged flag does not grant SYS_ADMIN to non-root container users. It only grants SYS_ADMIN to UID 0.

Using --cap-add SYS_ADMIN properly grants SYS_ADMIN to any container user, regardless of UID.

Describe the results you expected: I expected the --privileged flag to grant SYS_ADMIN to all container users, regardless of UID.

Additional information you deem important (e.g. issue happens only occasionally): I am running podman in rootless mode. Unfortunately I am not equipped to test this in root mode. This behavior I described also happens with --userns=keep-id.

Output of podman version:

Client:       Podman Engine
Version:      4.0.1
API Version:  4.0.1
Go Version:   go1.17.7
Git Commit:   c8b9a2e3ec3630e9172499e15205c11b823c8107
Built:        Thu Feb 24 16:44:27 2022
OS/Arch:      linux/amd64

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.24.1
  cgroupControllers: []
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: /usr/bin/conmon is owned by conmon 1:2.1.0-1
    path: /usr/bin/conmon
    version: 'conmon version 2.1.0, commit: bdb4f6e56cd193d40b75ffc9725d4b74a18cb33c'
  cpus: 12
  distribution:
    distribution: artix
    version: unknown
  eventLogger: file
  hostname: [redacted]
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.16.12-artix1-1
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 13220433920
  memTotal: 50465959936
  networkBackend: cni
  ociRuntime:
    name: crun
    package: /usr/bin/crun is owned by crun 1.4.3-1
    path: /usr/bin/crun
    version: |-
      crun version 1.4.3
      commit: 61c9600d1335127eba65632731e2d72bc3f0b9e8
      spec: 1.0.0
      +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /etc/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: /usr/bin/slirp4netns is owned by slirp4netns 1.1.12-1
    version: |-
      slirp4netns version 1.1.12
      commit: 7a104a101aa3278a2152351a082a6df71f57c9a3
      libslirp: 4.6.1
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.3
  swapFree: 8556638208
  swapTotal: 8589930496
  uptime: 17h 30m 8.06s (Approximately 0.71 days)
plugins:
  log:
  - k8s-file
  - none
  - passthrough
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries: {}
store:
  configFile: /home/[redacted]/.config/containers/storage.conf # Note - this file does not exist, and I did not create this file, I am using the default config
  containerStore:
    number: 1
    paused: 0
    running: 0
    stopped: 1
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/[redacted]/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 25
  runRoot: /run/user/1000/containers
  volumePath: /home/[redacted]/.local/share/containers/storage/volumes
version:
  APIVersion: 4.0.1
  Built: 1645721067
  BuiltTime: Thu Feb 24 16:44:27 2022
  GitCommit: c8b9a2e3ec3630e9172499e15205c11b823c8107
  GoVersion: go1.17.7
  OsArch: linux/amd64
  Version: 4.0.1

Package info (e.g. output of rpm -q podman or apt list podman):

$ pacman -Q --info podman

Name            : podman
Version         : 4.0.1-1
Description     : Tool and library for running OCI-based containers in pods
Architecture    : x86_64
URL             : https://github.com/containers/podman
Licenses        : Apache
Groups          : None
Provides        : None
Depends On      : conmon  containers-common  crun  fuse-overlayfs  iptables  libdevmapper.so=1.02-64  libgpgme.so=11-64  libseccomp.so=2-64  slirp4netns
Optional Deps   : apparmor: for AppArmor support
                  btrfs-progs: support btrfs backend devices
                  catatonit: --init flag support
                  netavark: for a new container-network-stack implementation
                  podman-compose: for docker-compose compatibility
                  podman-docker: for Docker-compatible CLI
Required By     : None
Optional For    : None
Conflicts With  : None
Replaces        : None
Installed Size  : 79.54 MiB
Packager        : Artix Build Bot <jenkins@artixlinux.org>
Build Date      : Thu 24 Feb 2022 04:44:27 PM UTC
Install Date    : Sun 06 Mar 2022 04:31:11 AM UTC
Install Reason  : Explicitly installed
Install Script  : No
Validated By    : Signature

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)

Yes - this is the latest version available to my distribution. I have checked the troubleshooting guide, and a maintainer commented in another issue suggesting I file this issue.

Additional environment details (AWS, VirtualBox, physical, etc.):

Physical headless server

rhatdan commented 2 years ago

Looks like the cap is set to me.

podman run --cap-add SYS_ADMIN fedora capsh --print | grep sys_admin
Current: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_sys_admin,cap_setfcap=eip
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_sys_admin,cap_setfcap
Current IAB: cap_chown,cap_dac_override,!cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,!cap_linux_immutable,cap_net_bind_service,!cap_net_broadcast,!cap_net_admin,!cap_net_raw,!cap_ipc_lock,!cap_ipc_owner,!cap_sys_module,!cap_sys_rawio,cap_sys_chroot,!cap_sys_ptrace,!cap_sys_pacct,cap_sys_admin,!cap_sys_boot,!cap_sys_nice,!cap_sys_resource,!cap_sys_time,!cap_sys_tty_config,!cap_mknod,!cap_lease,!cap_audit_write,!cap_audit_control,cap_setfcap,!cap_mac_override,!cap_mac_admin,!cap_syslog,!cap_wake_alarm,!cap_block_suspend,!cap_audit_read,!cap_perfmon,!cap_bpf,!cap_checkpoint_restore

I have a feeling masking or some other security feature like SELinux is blocking the access?

ubergeek77 commented 2 years ago

--cap-add SYS_ADMIN was never an issue. That works fine, see step 3 of my issue report.

Can you try your test command again using --privileged? That's where I'm having problems.

mheon commented 2 years ago

@rhatdan You don't have a --user in there.

I recall this being related to Docker compat - Docker does not grant certain capabilities to containers when a non-root user is set, even if the container is privileged. I'm on vacation so I can't chase down specific bugs related to this, but I'm 90% sure this was changed to make our behavior closer to Docker's (and because defaulting to giving less caps is generally more secure, which is itself a strong argument).

rhatdan commented 2 years ago

You are right, we don't give all caps to the default user. Just add them to the bounding set.

$ podman run --privileged --user 1 fedora capsh --print 
Current: =
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,cap_bpf,cap_checkpoint_restore
Ambient set =
Current IAB: 
Securebits: 00/0x0/1'b0 (no-new-privs=0)
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
 secure-no-ambient-raise: no (unlocked)
uid=1(bin) euid=1(bin)
gid=1(bin)
groups=
Guessed mode: UNCERTAIN (0)

rhatdan commented 2 years ago

# docker run --privileged --user 1 fedora capsh --print 
Unable to find image 'fedora:latest' locally
latest: Pulling from library/fedora
edad61c68e67: Pull complete 
Digest: sha256:40ba585f0e25c096a08c30ab2f70ef3820b8ea5a4bdd16da0edbfc0a6952fa57
Status: Downloaded newer image for fedora:latest
Current: =i cap_perfmon,cap_bpf,cap_checkpoint_restore-i
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read
Ambient set =
Current IAB: cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,!cap_perfmon,!cap_bpf,!cap_checkpoint_restore
Securebits: 00/0x0/1'b0 (no-new-privs=0)
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
 secure-no-ambient-raise: no (unlocked)
uid=1(bin) euid=1(bin)
gid=1(bin)
groups=
Guessed mode: UNCERTAIN (0)

rhatdan commented 2 years ago

Docker is slightly different but the user definitely does not get CAP_SYS_ADMIN.

If running with cap-add, Docker and Podman also differ.

# docker run --cap-add SYS_ADMIN --user 1 fedora capsh --print | grep Current:
Current: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_sys_admin,cap_mknod,cap_audit_write,cap_setfcap=i

$ podman run --cap-add SYS_ADMIN --user 1 fedora capsh --print | grep Current:
Current: cap_sys_admin=eip

giuseppe commented 2 years ago

From what I can see Docker doesn't have any special handling for uid != 0 and --privileged. I've looked into the generated OCI configuration file and they configure all the capabilities but they do not set Ambient capabilities, so on the exec into the container all the caps are effectively lost.

It seems that a lot of code in our capabilities handling is just there to emulate the issue in Docker. We should do the right thing and set all the caps without any special handling for uid != 0.

I suggest we do something like:

diff --git a/pkg/specgen/generate/security.go b/pkg/specgen/generate/security.go
index 988c29832..c643fde92 100644
--- a/pkg/specgen/generate/security.go
+++ b/pkg/specgen/generate/security.go
@@ -124,7 +124,7 @@ func securityConfigureGenerator(s *specgen.SpecGenerator, g *generate.Generator,
                capsRequiredRequested = strings.Split(val, ",")
            }
        }
-       if !s.Privileged && len(capsRequiredRequested) > 0 {
+       if len(capsRequiredRequested) > 0 {
            // Pass capRequiredRequested in CapAdd field to normalize capabilities names
            capsRequired, err := capabilities.MergeCapabilities(nil, capsRequiredRequested, nil)
            if err != nil {
@@ -158,9 +158,14 @@ func securityConfigureGenerator(s *specgen.SpecGenerator, g *generate.Generator,
        configSpec.Process.Capabilities.Effective = caplist
        configSpec.Process.Capabilities.Permitted = caplist
    } else {
-       mergedCaps, err := capabilities.MergeCapabilities(nil, s.CapAdd, nil)
+       var startingCaps []string
+       if s.Privileged {
+           startingCaps = caplist
+       }
+
+       mergedCaps, err := capabilities.MergeCapabilities(startingCaps, s.CapAdd, s.CapDrop)
        if err != nil {
-           return errors.Wrapf(err, "capabilities requested by user are not valid: %q", strings.Join(s.CapAdd, ","))
+           return err
        }
        boundingSet, err := capabilities.BoundingSet()
        if err != nil {

mheon commented 2 years ago

I believe the first time this came up it was considered a CVE because we were granting "excess capabilities" - so I'm not opposed, but we should be cautious given potential security implications here.

Will try and find the CVE once I'm done reviewing issues and PRs

giuseppe commented 2 years ago

that would be done only with --privileged. Isn't that the expectation when you use that flag?

mheon commented 2 years ago

Evidently, it is not. I found the CVE:

https://access.redhat.com/security/cve/CVE-2021-20188

It's not specifically SYS_ADMIN (I believe it's DAC_OVERRIDE) but it's definitely a too-many-caps issue.

giuseppe commented 2 years ago

Maybe it is too late to fix it, but I disagree with that CVE and the analysis. The current behavior is quite confusing as it looks like there is a separation between the "container capabilities" and the "PID 1 capabilities". To me, they are the same thing.

The command line I specify, IMO, should apply to the process that is launched. Instead, it seems that --privileged affects future exec sessions when they are running as root.

I think we just depend on a buggy behavior from Docker, since --privileged sets all the capabilities, but they forget to set Ambient capabilities, so the final result is that they are lost once the kernel execv the container process.

rhatdan commented 2 years ago

I would prefer we figure a way for users to specify it perhaps in containers.conf. I like the separation between root user having the caps and rootless requiring a setuid app to get it.

giuseppe commented 2 years ago

so is --privileged used only to enable all capabilities in the bounding set? If it is to affect exec sessions, there is always the possibility to specify --privileged for the exec itself:

$ podman run --rm -d --user 100 fedora sleep 100
b063e53126adcfc33dad5d3a1dd88c9a69a3f6e44d48c91d65848b1567193590
$ podman exec -l --user 0 --privileged grep ^Cap /proc/self/status
CapInh: 000001ffffffffff
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 000001ffffffffff

jan-hudec commented 2 years ago

The bounding set should also be usable by setuid binaries, right?

So with --privileged and non-0 uid the main process can't use the capabilities, but any setuid helpers it calls can. This allows restricting how the process can use the capabilities. Basically a --privileged container behaves exactly as the host system in this regard, which I think is a useful use-case.

giuseppe commented 2 years ago

Thanks for the feedback. Since it is working as expected, I am closing the issue.

ubergeek77 commented 2 years ago

Sorry, I'm not sure I see how this is expected behavior?

How am I supposed to give an arbitrary process, and any process it calls/forks, any permissions they need using only --privileged?

giuseppe commented 2 years ago

you can add them individually with --cap-add. Unfortunately there is a check in podman now that prevents it, but I've opened a PR to make it possible:

https://github.com/containers/podman/pull/13744

$ bin/podman  run --user 100 --cap-add=DAC_OVERRIDE --privileged --rm fedora grep ^Cap /proc/self/status
CapInh: 0000000000000002
CapPrm: 0000000000000002
CapEff: 0000000000000002
CapBnd: 000001ffffffffff
CapAmb: 0000000000000002

$ bin/podman  run --user 100 --cap-add=ALL --privileged --rm fedora grep ^Cap /proc/self/status
CapInh: 000001ffffffffff
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 000001ffffffffff

ubergeek77 commented 2 years ago

Perfect! Thanks for making that PR. That will solve the main problem that led me to make this issue in the first place.

containers / podman

[Rootless] --privileged does not grant SYS_ADMIN to non-UID-0 #13449