containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.95k stars 2.43k forks source link

Mounts not recursive anymore with '--userns=keep-id' #14183

Closed ensc closed 2 years ago

ensc commented 2 years ago

/kind bug

Description

When using --userns=keep-id, mounts are not recursive anymore. This seems to be a regression because it worked in previous podman versions (I think podman-3.4.7-1.fc35.x86_64 broke it). I have not tested it with podman-4

NOTE: this can be a security issue because it allows to break recursive mounts and reveal the original content.

Steps to reproduce the issue:

  1. create some deeply mounted filesystem structure
# mkdir /tmp/foo/bar
# mount --bind /proc /tmp/foo/bar
# setenforce 0        # (only for this test...)
  1. create a rootless container and mount directory above
$ podman run --rm --mount=type=bind,src=/tmp/foo,dst=/mnt fedora:35 ls -la /mnt/bar/self
lrwxrwxrwx. 1 nobody nobody 0 May  6 10:39 /mnt/bar/self -> 707487
  1. do the same again, but now with --userns=keep-id
$ podman run --rm --userns=keep-id --mount=type=bind,src=/tmp/foo,dst=/mnt fedora:35 ls -la /mnt/bar/self
ls: cannot access '/mnt/bar/self': No such file or directory

Describe the results you received:

In step 3, the procfs is not mounted.

Output of podman version:

Version:      3.4.7
API Version:  3.4.7
Go Version:   go1.16.15
Built:        Thu Apr 21 15:14:26 2022
OS/Arch:      linux/amd64

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.23.1
  cgroupControllers:
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.0-2.fc35.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.0, commit: '
  cpus: 8
  distribution:
    distribution: fedora
    version: "35"
  eventLogger: journald
  hostname: sinclair.bigo.ensc.de
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 1000000000
      size: 1000000
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 1000000000
      size: 1000000
  kernel: 5.17.5-200.fc35.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 9175855104
  memTotal: 33301999616
  ociRuntime:
    name: crun
    package: crun-1.4.4-1.fc35.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.4.4
      commit: 6521fcc5806f20f6187eb933f9f45130c86da230
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.1.12-2.fc35.x86_64
    version: |-
      slirp4netns version 1.1.12
      commit: 7a104a101aa3278a2152351a082a6df71f57c9a3
      libslirp: 4.6.1
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.3
  swapFree: 2147479552
  swapTotal: 2147479552
  uptime: 106h 7m 40.69s (Approximately 4.42 days)
plugins:
  log:
  - k8s-file
  - none
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /home/ensc/.config/containers/storage.conf
  containerStore:
    number: 4
    paused: 0
    running: 4
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: fuse-overlayfs-1.7.1-2.fc35.x86_64
      Version: |-
        fusermount3 version: 3.10.5
        fuse-overlayfs: version 1.7.1
        FUSE library version 3.10.5
        using FUSE kernel interface version 7.31
  graphRoot: /.local/home/ensc/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 272
  runRoot: /run/user/1000/containers
  volumePath: /.local/home/ensc/.local/share/containers/storage/volumes
version:
  APIVersion: 3.4.7
  Built: 1650546866
  BuiltTime: Thu Apr 21 15:14:26 2022
  GitCommit: ""
  GoVersion: go1.16.15
  OsArch: linux/amd64
  Version: 3.4.7

Package info (e.g. output of rpm -q podman or apt list podman):

podman-3.4.7-1.fc35.x86_64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)

No

Additional environment details (AWS, VirtualBox, physical, etc.):

rhatdan commented 2 years ago

@giuseppe PTAL

giuseppe commented 2 years ago

Rootless always uses recursive because that is enforced from the kernel.

Have you made the mount shared or slave? Please show the output of findmnt -R -o PROPAGATION,TARGET /.

ensc commented 2 years ago

@giuseppe it is shared; from the example above

shared      /
...
shared      ├─/tmp
shared      │ └─/tmp/foo/bar

When I mount it with rbind, it works as before (mount --rbind /proc /tmp/foo/bar). Probably the kernel prevents the unwanted behaviour only in this case.

But there is still a difference between --userns=keep-id and the default userns which did not exist before podman 3.4.7.

giuseppe commented 2 years ago

could it be a timing issue.

If you create the bind mount after the podman user+mount namespace was created, then it might not be propagated.

Could you try running podman system migrate after you create the bind mount? Does it make any difference?

giuseppe commented 2 years ago

I think the difference is that when the mount namespace is created then all the mounts there are rbind so it is not possible to look beneath a mount, but if that happens later and the new mount is propagated then it doesn't matter to hide what is beneath it, since it could already be possible to grab a open file descriptor to the previous path.

I am closing the issue as I am able to reproduce the same behavior both with 4.0, 3.4.7 and 3.4.1 but feel free to comment further

debarshiray commented 2 years ago

I suspect that this is the same as https://github.com/containers/toolbox/issues/1073

This is how I usually run my containers:

$ podman run -it --rm --security-opt label=disable --userns keep-id -v $HOME:$HOME:rslave -v /run/media:/run/media:rslave registry.fedoraproject.org/fedora:36 bash

If I mount something on the host after starting the container, then everything is good. However, if I mount something on the host before starting the container, then things get interesting.

For example, I plugged in a USB stick with the Fedora 36 Workstation Live ISO before starting the container, and then once I ran it:

$ ls /run/media/rishi/Fedora-WS-Live-36-1-5/
ls: cannot open directory '/run/media/rishi/Fedora-WS-Live-36-1-5/': Permission denied
$ ls -l /run/media/rishi
total 0
drwx------. 2 nobody nobody 40 Oct 27 16:55 Fedora-WS-Live-36-1-5

I don't know why the Fedora-WS-Live-36-1-5 directory is owned by nobody:nobody, because on the host it's owned by me:

$ ls -l /run/media/rishi/
total 2
drwxr-xr-x. 1 rishi rishi 2048 May  4 23:36 Fedora-WS-Live-36-1-5

I exited the container, and tried a simpler mount:

$ mkdir ~/tmp
$ sudo mount -t tmpfs none ~/tmp
$ echo "hello world" >~/tmp/hello-world
$ ls -ld ~/tmp
drwxrwxrwt. 2 root root 60 Oct 27 19:04 /home/rishi/tmp
$ ls -l ~/tmp/hello-world 
-rw-r--r--. 1 rishi rishi 12 Oct 27 19:04 /home/rishi/tmp/hello-world

Then I ran the container again:

$ cat ~/tmp/hello-world
cat: /home/rishi/tmp/hello-world: No such file or directory
$ ls ~/tmp
$ ls -ld ~/tmp
drwxr-xr-x. 2 rishi rishi 4096 Oct 27 16:31 /home/rishi/tmp
$ id
uid=1000(rishi) gid=1000(rishi) groups=1000(rishi)

It looks like the container doesn't know about the mount on the host at ~/tmp at all.

However, as mentioned before, things work as expected, if I run the container before the mounts are created.

It starts to work if I drop --userns keep-id and run my containers as:

$ podman run -it --rm --security-opt label=disable -v $HOME:$HOME:rslave -v /run/media:/run/media:rslave registry.fedoraproject.org/fedora:36 bash

With the USB stick:

# ls /run/media/rishi/Fedora-WS-Live-36-1-5/
EFI  Fedora-Legal-README.txt  LICENSE  LiveOS  images  isolinux
# ls -l /run/media/rishi/
total 2
drwxr-xr-x. 1 root root 2048 May  4 21:36 Fedora-WS-Live-36-1-5

With the tmpfs mount at ~/tmp:

# cat /home/rishi/tmp/hello-world 
hello world
# ls -l /home/rishi/tmp
total 4
-rw-r--r--. 1 root root 12 Oct 27 17:14 hello-world
# ls -ld /home/rishi/tmp
drwxrwxrwt. 2 nobody nobody 60 Oct 27 17:14 /home/rishi/tmp

With both mounts present on the host:

$ findmnt --submounts --output PROPAGATION,TARGET /
PROPAGATION TARGET
shared      /
shared      ├─/run
shared      │ └─/run/media/rishi/Fedora-WS-Live-36-1-5
shared      ├─/home
shared      │ └─/home/rishi/tmp
debarshiray commented 2 years ago

I tried to reproduce this with bind mounts in the host's mount and user namespaces, without any containers or child namespaces around, with:

$ sudo mount --rbind -o rslave /run/media ~/devel/foo

Things work as expected regardless of whether I plug in the USB stick before I create the bind mount or after:

$ ls ~/devel/foo/rishi/Fedora-WS-Live-36-1-5/
EFI  Fedora-Legal-README.txt  images  isolinux  LICENSE  LiveOS