containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.83k stars 2.42k forks source link

First mount of NFS volume fails with "lchown... operation not permitted" error #20801

Closed bcat closed 11 months ago

bcat commented 11 months ago

Issue Description

I'm trying to convert some self-hosted Docker apps (nothing fancy, just a few services in a single homelab VM) to Podman, as I like the greater flexibility it provides around user namespaces. In the process, I noticed what seems to be a regression of #14766. Since that bug is locked, I figured I'd file a new one.

Steps to reproduce the issue

  1. On server example.com, export an NFS share at /path that's owned by a non-root user (e.g., 3000 in the example output below).
  2. On a different machine, create an NFS volume. (Use rootful Podman to avoid issues mounting NFS as an unprivileged user.) $ sudo podman volume create -o type=nfs -o device=example.com:/path share
  3. On the same machine as Step 1, run a container mounting the volume. $ sudo podman run --rm -v share:/mnt/share docker.io/library/alpine ls -al /mnt/share

Describe the results you received

The first time I run a container mounting the NFS volume, I receive the following error and the container fails to start:

$ sudo podman run --rm -v share:/mnt/share docker.io/library/alpine ls -al /mnt/share
Error: lchown /var/lib/containers/storage/volumes/share/_data: operation not permitted

Subsequent podman run commands using the same volume run successfully and yield the expected output (e.g., listing files in the NFS share in the example above).

Describe the results you expected

I expect the container to run and list files in the mounted NFS volume. For the example above, this should look something like the following:

$ sudo podman run --rm -v share:/mnt/share docker.io/library/alpine ls -al /mnt/share
total 5
drwxr-xr-x    2 3000     3000             2 Nov 27 20:18 .
drwxr-xr-x    1 root     root          4096 Nov 27 22:03 ..

podman info output

host:
  arch: amd64
  buildahVersion: 1.32.0
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - rdma
  - misc
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon_2.1.6+ds1-1_amd64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.6, commit: unknown'
  cpuUtilization:
    idlePercent: 99.64
    systemPercent: 0.11
    userPercent: 0.26
  cpus: 2
  databaseBackend: boltdb
  distribution:
    codename: bookworm
    distribution: debian
    version: "12"
  eventLogger: journald
  freeLocks: 2047
  hostname: ivy-test
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 6.1.0-13-amd64
  linkmode: dynamic
  logDriver: journald
  memFree: 105242624
  memTotal: 2056781824
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns_1.4.0-5_amd64
      path: /usr/lib/podman/aardvark-dns
      version: aardvark-dns 1.4.0
    package: netavark_1.4.0-4_amd64
    path: /usr/lib/podman/netavark
    version: netavark 1.4.0
  ociRuntime:
    name: crun
    package: crun_1.11.1-1_amd64
    path: /usr/bin/crun
    version: |-
      crun version 1.11.1
      commit: 1084f9527c143699b593b44c23555fb3cc4ff2f3
      rundir: /run/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt_0.0~git20231107.74e6f48-1_amd64
    version: |
      pasta unknown version
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: true
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: true
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns_1.2.1-1_amd64
    version: |-
      slirp4netns version 1.2.1
      commit: 09e31e92fa3d2a1d3ca261adaeb012c8d75a8194
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.4
  swapFree: 1022869504
  swapTotal: 1023406080
  uptime: 7h 30m 33.00s (Approximately 0.29 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries: {}
store:
  configFile: /usr/share/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 15262949376
  graphRootUsed: 2983059456
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "true"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 3
  runRoot: /run/containers/storage
  transientStore: false
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.7.2
  Built: 0
  BuiltTime: Wed Dec 31 18:00:00 1969
  GitCommit: ""
  GoVersion: go1.21.3
  Os: linux
  OsArch: linux/amd64
  Version: 4.7.2

Podman in a container

No

Privileged Or Rootless

Privileged

Upstream Latest Release

No

Additional environment details

I'm running Debian stable (12, "bookworm") with Podman packages from testing (13, "trixie"). This gets me Podman 4.7.2. Version 4.8.0 was just released today and isn't pacakged for Debian testing yet, but I don't see anything in the changelog to indicate this behavior has changed.

Side note: Are there plans for an official Podman apt repo like Docker offers? That would be quite handy since Debian releases infrequently, and while it's possible to get newer binaries from testing, it seems like it'd be cleaner to have a dedicated repo.

Additional information

This bug isn't showstopper since the NeedsChown flag on the volume is still cleared after the first failed mount attempt, but I feel like Podman shouldn't be trying to chown network volumes in the first place. Maybe NeedsChown should always be false if a mount type is specified at volume creation? (When a volume creates a new directory in the host's filesystem, the initial chown makes sense, but when the volume just mounts an existing device, it seems unexpected.)

bcat commented 11 months ago

Also, in case the magical chowning is just for Docker compatibility, I verified that Docker doesn't have the same issue. The NFS volume works the first (and every other) time:

$ sudo docker volume create -o type=nfs -o o=addr=example.com -o device=:/path share
share

$ sudo docker run --rm -v share:/mnt/share docker.io/library/alpine ls -al /mnt/share
total 5
drwxr-xr-x    2 3000     3000             2 Nov 27 20:18 .
drwxr-xr-x    1 root     root          4096 Nov 28 00:38 ..
bcat commented 11 months ago

It occurs to me that since --volume already accepts option :U to recursively chown the volume to the container's user, maybe there could be be option :u to never chown the volume. Then there would be three modes to consider and document:

The "unspecified" behavior could be made Docker compatible (e.g., fixing this issue, and #19652 as well), but folks fully integrated into the Podman ecosystem could use :u and :U to get explicit (and arguably more useful) ownership handling. WDYT?

Also, aside, but fixing #19652 without also fixing the network volume issue (or adding some option like :u to completely disable the chown behavior) would make this issue much more severe, as every attempt to start a container with an NFS volume would fail, not just the first one. :)

rhatdan commented 11 months ago

The first time we use a volume we are attempting to chown a file system we are attempting to chown the underlying directory to match the destination, in this case this seems like a bug. We must not be checking if the volume is already set correctly. IE If it is already root, then the we should not care that the chown failed.

bcat commented 11 months ago

We must not be checking if the volume is already set correctly. IE If it is already root, then the we should not care that the chown failed.

In my example:

  1. There is no underlying directory in the container image. (I'm mounting a volume on /mnt/share, which is not a directory in the container image.)
  2. The container's initial UID is 0, which (intentionally) differs from the exported NFS share's owner on the remote server. (I am not using any user namespace remapping in this example, for simplicity.)

So I think just skipping (or making optional) the chown if container_owner == volume_owner wouldn't help in this case.

For a more realistic use case, consider the Syncthing container. This container entrypoint starts as user 0, then the entrypoint drops privileges to run the Syncthing binary as an unprivileged user, say, 2998.

On the remote host, the exported NFS share intentionally has owner 2998, not 0. The idea is that the unprivileged user in the container (2998) should be able to write to the NFS share. So it's intended that the network volume owner (2998) and the container's initial user differ (0). No chown should be attempted even though the two differ.

On Docker, this exact workflow works correctly (Compose file). I am not positive why... maybe Docker doesn't try to chown network volumes at all?

bcat commented 11 months ago

@rhatdan I dug a bit more into the behavioral differences that seem to cause my test case to work in Docker but fails in Podman.

When mounting into a target directory that already exists in the container image, Docker will by default (unless overridden by the nocopy mount option) copy contents of the image's target directory into the mount source directory on the host. This operation includes chmod and chown.

So in Docker, when you mount an NFS share into a directory that already exists in the image (e.g., /mnt in the alpine image), it fails for a similar reason as in Podman:

$ sudo docker volume create -o type=nfs -o o=addr=example.com-o device=:/path share
share

$ sudo docker run --rm -v share:/mnt docker.io/library/alpine ls -al /mnt
docker: Error response from daemon: failed to chmod on /var/lib/docker/volumes/share/_data: chmod /var/lib/docker/volumes/share/_data: operation not permitted.
See 'docker run --help'.

But in Docker, when the mount target path does not exist in the container (e.g., /mnt/share in the alpine image), no chmod or chown on the source path happens, and the operation succeeds:

$ sudo docker volume create -o type=nfs -o o=addr=example.com-o device=:/path share
share

$ sudo docker run --rm -v share:/mnt/share docker.io/library/alpine ls -al /mnt/share
total 5
drwxr-xr-x    2 3000     3000             2 Nov 27 20:18 .
drwxr-xr-x    1 root     root          4096 Nov 28 05:18 ..

Importantly, in Docker, mounting a source directory (including an NFS mount) into a target path that doesn't exist in the container image succeeds, even if the source directory isn't owned by root. If I understand correctly, SafeLchown would still fail in that case.