containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
22.37k stars 2.31k forks source link

newuidmap Fails with “Operation not permitted” When Running Podman Inside amd64 Podman container on macOS with Rosetta #23041

Open samsaket7 opened 2 weeks ago

samsaket7 commented 2 weeks ago

Issue Description

I am experiencing an issue when trying to run amd64 Podman container inside a amd64 Podman container on macOS with Rosetta. The specific error occurs during the setup of user namespaces with newuidmap, resulting in the following error message: time="2024-06-19T00:52:32Z" level=error msg="running /usr/bin/newuidmap 12 0 1000 1 1 1 999 1000 1001 64535: newuidmap: write to uid_map failed: Operation not permitted\n" Error: cannot set up namespace using "/usr/bin/newuidmap": exit status 1

Steps to reproduce the issue

Steps to reproduce the issue 1.Install Podman(5.1.1) on macOS M1.

  1. Run the following command: podman run --arch=amd64 --user podman --privileged quay.io/podman/stable podman run --security-opt label=disable --arch=amd64 ubi8 echo hello

Describe the results you received

  1. When running the nested Podman command with --arch=amd64, the operation fails with an “Operation not permitted” error during the newuidmap setup.

    time="2024-06-19T00:52:32Z" level=error msg="running `/usr/bin/newuidmap 12 0 1000 1 1 1 999 1000 1001 64535`: newuidmap: write to uid_map failed: Operation not permitted\n"
    Error: cannot set up namespace using "/usr/bin/newuidmap": exit status 1
  2. The same setup works correctly with --arch=arm64. podman run --arch=arm64 --user podman --privileged quay.io/podman/stable podman run --security-opt label=disable --arch=amd64 ubi8 echo hello

Describe the results you expected

The newuidmap should correctly map user IDs without encountering permission issues, allowing Podman to run nested containers with --arch=amd64 on macOS with Rosetta.

podman info output

host:
  arch: arm64
  buildahVersion: 1.36.0
  cgroupControllers:
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.10-1.fc40.aarch64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.10, commit: '
  cpuUtilization:
    idlePercent: 98.82
    systemPercent: 0.25
    userPercent: 0.93
  cpus: 8
  databaseBackend: sqlite
  distribution:
    distribution: fedora
    variant: coreos
    version: "40"
  eventLogger: journald
  freeLocks: 1800
  hostname: localhost.localdomain
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 1000000
    uidmap:
    - container_id: 0
      host_id: 502
      size: 1
    - container_id: 1
      host_id: 100000
      size: 1000000
  kernel: 6.8.11-300.fc40.aarch64
  linkmode: dynamic
  logDriver: journald
  memFree: 5679370240
  memTotal: 16703303680
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.11.0-1.20240531102943328308.main.4.g6838c50.fc40.aarch64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.12.0-dev
    package: netavark-1.11.0-1.20240606174759319307.main.8.gfebe31a.fc40.aarch64
    path: /usr/libexec/podman/netavark
    version: netavark 1.12.0-dev
  ociRuntime:
    name: crun
    package: crun-1.15-1.20240607090105650503.main.32.gea54402.fc40.aarch64
    path: /usr/bin/crun
    version: |-
      crun version UNKNOWN
      commit: 7cfd0aeb40e4605b6b0ee0afd9cfca80f9c5f68a
      rundir: /run/user/502/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-0^20240510.g7288448-1.fc40.aarch64
    version: |
      pasta 0^20240510.g7288448-1.fc40.aarch64-pasta
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: true
    path: /run/user/502/podman/podman.sock
  rootlessNetworkCmd: pasta
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: true
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.2-2.fc40.aarch64
    version: |-
      slirp4netns version 1.2.2
      commit: 0ee2d87523e906518d34a6b423271e4826f71faf
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 0
  swapTotal: 0
  uptime: 3h 18m 33.00s (Approximately 0.12 days)
  variant: v8
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /var/home/core/.config/containers/storage.conf
  containerStore:
    number: 83
    paused: 0
    running: 1
    stopped: 82
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /var/home/core/.local/share/containers/storage
  graphRootAllocated: 106769133568
  graphRootUsed: 35398475776
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 266
  runRoot: /run/user/502/containers
  transientStore: false
  volumePath: /var/home/core/.local/share/containers/storage/volumes
version:
  APIVersion: 5.1.1
  Built: 1717459200
  BuiltTime: Mon Jun  3 17:00:00 2024
  GitCommit: ""
  GoVersion: go1.22.3
  Os: linux
  OsArch: linux/arm64
  Version: 5.1.1

Podman in a container

Yes

Privileged Or Rootless

Rootless

Upstream Latest Release

Yes

Additional environment details

Additional environment details

Additional information

Additional information like issue happens only occasionally or issue happens with a particular architecture or on a particular setting

baude commented 1 week ago

any ideas here @giuseppe

giuseppe commented 1 week ago

some weird interaction between binfmt and a file with capabilities (newuidmap).

Could you please share the output of grep . /proc/sys/fs/binfmt_misc/*?

Luap99 commented 6 days ago

The default binfmt setup doesn't allow setuid binaries https://docs.kernel.org/admin-guide/binfmt-misc.html

C - credentials

Currently, the behavior of binfmt_misc is to calculate the credentials and security token of the new process according to the interpreter. When this flag is included, these attributes are calculated according to the binary. It also implies the O flag. This feature should be used with care as the interpreter will run with root permissions when a setuid binary owned by root is run with binfmt_misc.

We could of course change binfmt_misc configs in machine to set this flag.

rhatdan commented 6 days ago

If this capability exists, we should take advantage of it. Podman Machines are not expected to have network facing connections so the risk of turning something on like this by default is mitigated, and the downsides of people hitting it are big. One question though would this effect Rosetta based systems?

Luap99 commented 6 days ago

Yeah changing it for rosetta (x86_64) is simple, https://github.com/containers/podman-machine-os/blob/a52eab8a5fa6790495d90180d69ef94c09f6150e/podman-image-daily/rosetta-activation.sh#L8

We can add the flag there. Where I not sure is how to configure the qemu-user-static scripts for the other arches to make use it of. I would like it to be consistent and not just working with rosetta then.