Mount point ownership not consistent with Docker's behaviour

matejvasek commented 1 year ago

Issue Description

When mounting a volume into a container the mountpoint directory should preserve it's ownership. This seems to work only for very first run/mount. Subsequent mounts have altered ownership of mountpoints directory (to the ownership set by first mounter).

This happens at least podman v4.3 -- v4.6.

Steps to reproduce the issue

Run following script against podman docker compat socket:

#!/bin/sh

set -e

cat <<EOF > Dockerfile.usera
FROM alpine
USER root
RUN mkdir -p /workspace
RUN chown 1001:1002 /workspace
USER 1001:1002
EOF

cat <<EOF > Dockerfile.userb
FROM alpine
USER root
RUN mkdir -p /workspace
RUN chown 1003:1004 /workspace
USER 1003:1004
EOF

docker build -q . -f Dockerfile.usera -t alpine-usera
docker build -q . -f Dockerfile.userb -t alpine-userb
docker volume rm test-volume || true
docker run --rm -v test-volume:/workspace alpine-usera sh -c 'echo done'
# command below fails on podman because of permissions
docker run --rm -v test-volume:/workspace alpine-userb sh -c 'touch /workspace/b'
docker volume rm test-volume || true

Describe the results you received

The script exits with non-zero exit code and error message.

touch: /workspace/b: Permission denied

Describe the results you expected

The script exits with 0 exit code.

podman info output

host:
  arch: amd64
  buildahVersion: 1.31.2
  cgroupControllers:
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.7-2.fc37.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.7, commit: '
  cpuUtilization:
    idlePercent: 96.82
    systemPercent: 0.74
    userPercent: 2.44
  cpus: 16
  databaseBackend: boltdb
  distribution:
    distribution: fedora
    variant: workstation
    version: "37"
  eventLogger: journald
  freeLocks: 2011
  hostname: rigel
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 6.4.9-100.fc37.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 942694400
  memTotal: 67101196288
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.7.0-1.fc37.x86_64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.7.0
    package: netavark-1.7.0-1.fc37.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.7.0
  ociRuntime:
    name: crun
    package: crun-1.8.6-1.fc37.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.6
      commit: 73f759f4a39769f60990e7d225f561b4f4f06bcf
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-0^20230625.g32660ce-1.fc37.x86_64
    version: |
      pasta 0^20230625.g32660ce-1.fc37.x86_64
      Copyright Red Hat
      GNU Affero GPL version 3 or later <https://www.gnu.org/licenses/agpl-3.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-8.fc37.x86_64
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 1392640
  swapTotal: 8589930496
  uptime: 89h 60m 31.00s (Approximately 3.71 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  docker.io:
    Blocked: false
    Insecure: false
    Location: docker.io
    MirrorByDigestOnly: false
    Mirrors:
    - Insecure: false
      Location: quay.io/mvasek
      PullFromMirror: ""
    Prefix: docker.io
    PullFromMirror: ""
  localhost:50000:
    Blocked: false
    Insecure: true
    Location: localhost:50000
    MirrorByDigestOnly: false
    Mirrors: null
    Prefix: localhost:50000
    PullFromMirror: ""
  search:
  - example.com
store:
  configFile: /home/mvasek/.config/containers/storage.conf
  containerStore:
    number: 10
    paused: 0
    running: 1
    stopped: 9
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/mvasek/.local/share/containers/storage
  graphRootAllocated: 1022488477696
  graphRootUsed: 409338372096
  graphStatus:
    Backing Filesystem: btrfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 65
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /home/mvasek/.local/share/containers/storage/volumes
version:
  APIVersion: 4.6.2-dev
  Built: 1692205486
  BuiltTime: Wed Aug 16 19:04:46 2023
  GitCommit: 8183ba8b256442910154d4d264deac9d12242eae
  GoVersion: go1.20.2
  Os: linux
  OsArch: linux/amd64
  Version: 4.6.2-dev

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

Yes

Additional environment details

I tested this on rootless but I believe the same thing happens for privileged too.

Additional information

Happens always.

giuseppe commented 1 year ago

I see the ownership is kept if you delete the volume before running the second command.

Does docker automatically delete the volume when the first container exits?

matejvasek commented 1 year ago

@giuseppe no the volume persists.

matejvasek commented 1 year ago

Another way to reproduce: try building an app using pack CLI with podman and untrusted builder.

matejvasek commented 1 year ago

@giuseppe but you might be onto something: the ownership behaves differently the moment I try to write something into the volume.

matejvasek commented 1 year ago

Maybe I isolated the bug in wrong way, but there's definitely some issues with volume mounting. The pack CLI does application build in multiple containers that share some data via volumes. With Docker it works with podman it fails because of ownership issues.

matejvasek commented 1 year ago

Create simple Go app (e.g hello world).
Run pack build my-go-app -Bghcr.io/knative/builder-jammy-full:latest --docker-host=inherit --trust-builder=0.
Build fails because permission on shared volume.

matejvasek commented 1 year ago

@giuseppe try running:

#!/bin/sh

set -e

cat <<EOF > Dockerfile.usera
FROM alpine
USER root
RUN mkdir -p /workspace
RUN chown 1001:1002 /workspace
USER 1001:1002
EOF

cat <<EOF > Dockerfile.userb
FROM alpine
USER root
RUN mkdir -p /workspace
RUN chown 1003:1004 /workspace
USER 1003:1004
EOF

docker build -q . -f Dockerfile.usera -t alpine-usera
docker build -q . -f Dockerfile.userb -t alpine-userb
docker volume rm test-volume || true
docker run --rm -v test-volume:/workspace alpine-usera sh -c 'echo done'
docker run --rm -v test-volume:/workspace alpine-userb sh -c 'touch /workspace/b'
docker volume rm test-volume || true

With docker it works but on podman it fails.

matejvasek commented 1 year ago

@giuseppe note that if the first container actually tried to write to /workspace/ it would fail with Moby too. But in our usecase the first container uses the volume as read only. Although it may not declare it via :ro

giuseppe commented 1 year ago

@mheon do we need to change ownership every time we use the volume in a container?

mheon commented 1 year ago

I have to assume we added that code for a reason, but I can't recall exactly why. Almost certainly a bugfix, but exactly what was being fixed is unclear. The exact on-mount behavior for volumes versus Docker has been a persistent problem.

matejvasek commented 1 year ago

fyi in the past even the very first mounting container had bad ownership, see https://github.com/containers/podman/pull/10905

mheon commented 1 year ago

I actually cannot find an explicit chown of the volume mountpoint anywhere in the mount code. So I'm actually not 100% on where this is being done; it may be an unintentional side-effect of another chown doing something else?

matejvasek commented 1 year ago

Looks like the chown is called only when volume if brand new -- created together with a new container.

matejvasek commented 1 year ago

https://github.com/containers/podman/blob/5ea019419cd78457230cf4d15ee459bf4288a1bd/libpod/container_internal_common.go#L2873-L2881

matejvasek commented 1 year ago

wrt:

docker run --rm -v test-volume:/workspace alpine-usera sh -c 'echo done'
docker run --rm -v test-volume:/workspace alpine-userb sh -c 'touch /workspace/b'

It appears that chow is called only for the first container.

matejvasek commented 1 year ago

there is some state vol.state.NeedsChown I assume this ensures that chown is done once?

matejvasek commented 1 year ago

The vol.state.NeedsChown seems to be set on the first chown done by the first container, so subsequent containers won't chown it.

matejvasek commented 1 year ago

@giuseppe how important is vol.state.NeedsChown?

matejvasek commented 1 year ago

https://github.com/containers/podman/blob/5ea019419cd78457230cf4d15ee459bf4288a1bd/libpod/container_internal_common.go#L2837-L2838

mheon commented 1 year ago

Ah, ok, per @matejvasek it's fixVolumePermissions()

Looking further, it's tied to a bool, NeedsChown, in the volume config. Set to true at volume create, false once permissions have been fixed during first mount into a container. Dropping the bool entirely and making the chown unconditional ought to fix this?

matejvasek commented 1 year ago

@mheon I believe it will fix the issue, but I don't know if it could have any adverse effects.

rhatdan commented 1 year ago

It is not doing a recursive chown, correct? I think the goal there was to make sure the volume is owned by the primart user of the container. I think I had a PR on this code at one point to attempt to change it, but I gave up. https://github.com/containers/podman/pull/16782

matejvasek commented 1 year ago

make sure the volume is owned by the primart user of the container.

Small correction: primary user uid/gid is used only if the mount point does not already exist in the container. If the mount point exist (as directory) then uid/gid of the directory shall be used.

matejvasek commented 1 year ago

Setting ownership just once makes sense if you assume that the volume will be always used just by single container. However that's not my case the pack CLI runs multiple containers in sequence on one volume.

mheon commented 1 year ago

If there's a reason we added this originally, it's Docker compat. If Docker doesn't do the same thing, they that reason is not valid.

matejvasek commented 1 year ago

If there's a reason we added this originally, it's Docker compat. If Docker doesn't do the same thing, they that reason is not valid.

What you mean by this here? The fact that we do chown, or the fact that we do it only once?

mheon commented 1 year ago

Only once. There's no reason we'd add such complexity other than to match Docker

matejvasek commented 1 year ago

Looks like NeedsChown was introduced in https://github.com/containers/podman/pull/6747 which was fixing https://github.com/containers/podman/issues/5698.

matejvasek commented 1 year ago

The issue does not directly mention Docker.

matejvasek commented 1 year ago

Also --userns that was supposed to be fixed does not even exist on Docker. So I believe that NeedsChown has nothing to do with Docker compatibility.

matejvasek commented 1 year ago

@giuseppe do you recall why NeedsChown was needed?

giuseppe commented 1 year ago

I don't remember more than what is in the original issue.

Also, how does it work on Docker if multiple containers use the same volume? It is chown'ed to the last user?

matejvasek commented 1 year ago

Also, how does it work on Docker if multiple containers use the same volume? It is chown'ed to the last user?

Yes, it appears so. It is chowned to the uid/gid of the mountpoint for each container.

matejvasek commented 1 year ago

See the test in bug description. The second container can create a file in mountpoint directory. It works on Docker, but not on podman.

giuseppe commented 1 year ago

but what happens if you keep the first container running?

Does the second container override the first chown?

matejvasek commented 1 year ago

It runs in series. Not in parallel.

matejvasek commented 1 year ago

My use case is sequential.

matejvasek commented 1 year ago

Does the second container override the first chown?

But yes, that's what appears to happen.

giuseppe commented 1 year ago

sure but if we are going to change the current behavior to match docker compatibility we need to address all the cases, not just one.

What I am arguing is that the docker mechanism of always chowning, is subject to race conditions:

# (sleep 0.5; docker run --rm -v test-volume:/workspace alpine-userb true) & docker run  --rm -v test-volume:/workspace alpine-usera sh -c 'stat -c %u:%g /workspace ; sleep 1; stat -c %u:%g /workspace' 
1001:1002
1003:1004

The ownership of a volume can change while a container is using it.

The Podman behavior makes more sense, we can probably just document it and not aim for compatibility

matejvasek commented 1 year ago

is subject to race conditions:

Well if somebody runs two containers at same time on the same volume that they had it coming. Also even in current form this is a problem: only one of the containers "wins" and forces it's ownership.

matejvasek commented 1 year ago

@giuseppe I'll try to modify pack CLI to work even with podman behaviour. Will see if it's possible.

matejvasek commented 1 year ago

The thing is that pack CLI first run root owned container on the volume. All subsequent containers then cannot create files in the volume as a consequence.

mheon commented 1 year ago

I think that we should probably change to match Docker here - IE, unconditional chown on mount into a container, even if the volume is in use by another container. It's racy, but that's sort of expected - using different UID/GID containers with the same volume is always going to be questionable

Luap99 commented 1 year ago

If we do it then we should add some option to persevere the current behaviour. I can easily see this as big performance hit if we start to chown on each container start, consider volumes with many files.

matejvasek commented 1 year ago

If we do it then we should add some option to persevere the current behaviour. I can easily see this as big performance hit if we start to chown on each container start, consider volumes with many files.

In't the chown non recursive?

matejvasek commented 1 year ago

I believe in the past it was recursive but now it is not, right?

Luap99 commented 1 year ago

Ah you are right the normal volume is not recursive. Then this should not be a problem, although it seems very weird to not chown recursively. If the first container created files in the volume we just chown the parent directory how is the second container supposed to read the files?

mheon commented 1 year ago

I don't think that's expected to work. We want the new container to at least be able to write to the mounted volume. If the goal is multiple containers which use different UIDs and GIDs to be able to read from the same volume, I don't know if that's possible right now, nor do I think it's necessarily expected.

rhatdan commented 1 year ago

It should not recursively chown. Although we do have :U which I think recursively chown.

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.

containers / podman