containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.86k stars 2.42k forks source link

Permission denied when container process executes `close_range` syscall #10337

Closed smac89 closed 3 years ago

smac89 commented 3 years ago

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description I have an application which uses close_range syscall running inside a container. When I run the container, and the application makes that syscall, I get an error saying "Permission denied".

At first I was thinking this was a problem with the application, but after some investigating, I am starting to think this may be a podman issue and may have something to do with how it handles seccomp profiles.

Steps to reproduce the issue:

walk.c ```c #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include /* Show the contents of the symbolic links in /proc/self/fd */ static void show_fds(void) { DIR *dirp = opendir("/proc/self/fd"); if (dirp == NULL) { perror("opendir"); exit(EXIT_FAILURE); } for (;;) { struct dirent *dp = readdir(dirp); if (dp == NULL) break; if (dp->d_type == DT_LNK) { char path[PATH_MAX], target[PATH_MAX]; snprintf(path, sizeof(path), "/proc/self/fd/%s", dp->d_name); ssize_t len = readlink(path, target, sizeof(target)); printf("%s ==> %.*s\n", path, (int) len, target); } } closedir(dirp); } int main(int argc, char *argv[]) { for (int j = 1; j < argc; j++) { int fd = open(argv[j], O_RDONLY); if (fd == -1) { perror(argv[j]); exit(EXIT_FAILURE); } printf("%s opened as FD %d\n", argv[j], fd); } show_fds(); printf("========= About to call close_range() =======\n"); if (syscall(__NR_close_range, 3, ~0U, 0) == -1) { perror("close_range"); exit(EXIT_FAILURE); } show_fds(); exit(EXIT_SUCCESS); } ```
  1. Copy the above script to /tmp on your host machine

  2. Using buildah:

buildah bud --no-cache --platform linux/amd64 -f - /tmp <<'EOF'
FROM alpine:edge
RUN apk update && apk add --upgrade build-base libc-dev linux-headers
COPY walk.c /app/walk.c
RUN gcc -o /app/walk /app/walk.c
ENTRYPOINT ["/app/walk"]
EOF
  1. Run the resulting image with podman (replace 7bd46f9814bb with the id of the built image)
podman run --rm -it 7bd46f9814bb /app/walk.c

Describe the results you received:

The result will look something like:

/app/walk.c opened as FD 3
/proc/self/fd/0 ==> /dev/pts/0
/proc/self/fd/1 ==> /dev/pts/0
/proc/self/fd/2 ==> /dev/pts/0
/proc/self/fd/3 ==> /app/walk.c
/proc/self/fd/4 ==> /proc/1/fd
========= About to call close_range() =======
close_range: Operation not permitted

Describe the results you expected:

Now repeat this same process on your host linux machine (assuming you are running atleast kernel version 5.9)

The program should run successfully with an output similar to:

/tmp/walk.c opened as FD 3
/proc/self/fd/0 ==> /dev/pts/1
/proc/self/fd/1 ==> /dev/pts/1
/proc/self/fd/2 ==> /dev/pts/1
/proc/self/fd/3 ==> /tmp/walk.c
/proc/self/fd/4 ==> /proc/547032/fd
========= About to call close_range() =======
/proc/self/fd/0 ==> /dev/pts/1
/proc/self/fd/1 ==> /dev/pts/1
/proc/self/fd/2 ==> /dev/pts/1
/proc/self/fd/3 ==> /proc/547032/fd

This is what I expected inside the container

Additional information you deem important (e.g. issue happens only occasionally):

If you run the image with the option --security-opt seccomp=unconfined, everything works fine.

Does that mean podman is simply blocking the close_range syscall? Where does podman's default seccomp.json file live? I was under the impression that they use the default one from docker, which whitelists close_range syscall.

Output of podman version:

Version:      3.1.2
API Version:  3.1.2
Go Version:   go1.16.3
Git Commit:   51b8ddbc22cf5b10dd76dd9243924aa66ad7db39
Built:        Wed Apr 21 15:34:03 2021
OS/Arch:      linux/amd64

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.20.1
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: /usr/bin/conmon is owned by conmon 1:2.0.27-1
    path: /usr/bin/conmon
    version: 'conmon version 2.0.27, commit: 65fad4bfcb250df0435ea668017e643e7f462155'
  cpus: 12
  distribution:
    distribution: arcolinux
    version: unknown
  eventLogger: journald
  hostname: ArcoB
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 10000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 10000
      size: 65536
  kernel: 5.11.16-arch1-1
  linkmode: dynamic
  memFree: 18868236288
  memTotal: 41711120384
  ociRuntime:
    name: crun
    package: /usr/bin/crun is owned by crun 0.19.1-1
    path: /usr/bin/crun
    version: |-
      crun version 0.19.1
      commit: 1535fedf0b83fb898d449f9680000f729ba719f5
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  remoteSocket:
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    selinuxEnabled: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: /usr/bin/slirp4netns is owned by slirp4netns 1.1.9-1
    version: |-
      slirp4netns version 1.1.9
      commit: 4e37ea557562e0d7a64dc636eff156f64927335e
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.1
  swapFree: 32211202048
  swapTotal: 32211202048
  uptime: 6h 42m 15.48s (Approximately 0.25 days)
registries:
  search:
  - docker.io
  - ghcr.io
store:
  configFile: /home/chigozirim/.config/containers/storage.conf
  containerStore:
    number: 2
    paused: 0
    running: 1
    stopped: 1
  graphDriverName: overlay
  graphOptions:
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: /usr/bin/fuse-overlayfs is owned by fuse-overlayfs 1.5.0-1
      Version: |-
        fusermount3 version: 3.10.3
        fuse-overlayfs: version 1.5
        FUSE library version 3.10.3
        using FUSE kernel interface version 7.31
  graphRoot: /home/chigozirim/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 5
  runRoot: /run/user/1000/containers
  volumePath: /home/chigozirim/.local/share/containers/storage/volumes
version:
  APIVersion: 3.1.2
  Built: 1619040843
  BuiltTime: Wed Apr 21 15:34:03 2021
  GitCommit: 51b8ddbc22cf5b10dd76dd9243924aa66ad7db39
  GoVersion: go1.16.3
  OsArch: linux/amd64
  Version: 3.1.2

Package info (e.g. output of rpm -q podman or apt list podman):

Name                  : podman
Version               : 3.1.2-1
Description           : Tool and library for running OCI-based containers in
                        pods
URL                   : https://github.com/containers/libpod
Licenses              : Apache
Repository            : community
Installed Size        : 76.0 MB
Depends On            : cni-plugins conmon containers-common device-mapper
                        iptables libseccomp runc slirp4netns libsystemd
                        fuse-overlayfs libgpgme.so=11-64
Optional Dependencies : podman-docker: for Docker-compatible CLI [Installed]
                        btrfs-progs: support btrfs backend devices [Installed]
                        catatonit: --init flag support [Installed]
                        crun: support for unified cgroupsv2 [Installed]
Make Dependencies     : btrfs-progs go go-md2man git gpgme systemd
Packager              : Morten Linderud <foxboron@archlinux.org>
Build Date            : 2021-04-21
Install Date          : 2021-04-21
Install Reason        : Explicitly installed
Signatures            : Yes
Backup files          : /etc/cni/net.d/87-podman-bridge.conflist

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide?

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

mheon commented 3 years ago

Doesn't look like Seccomp. Our default profile lives at https://github.com/containers/common/blob/master/pkg/seccomp/seccomp.json#L80 and you can see that close_range is in the list of allowed calls.

smac89 commented 3 years ago

@mheon Do you have any other explanation for this behavior?

The reason I brought up seccomp is because like I said, using --security-opt seccomp=unconfined allows the container to run just fine. So why does this flag work if the problem has nothing to do with seccomp?

I've used strace in the real container: once with the flag and once without. With the flag, the strace log shows that the close_range syscall succeeds:

574   close_range(0, -1, CLOSE_RANGE_CLOEXEC <unfinished ...>
557   <... poll resumed>)               = 0 (Timeout)
559   futex(0x55b1c61a09b0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
574   <... close_range resumed>)        = 0

Without the flag, we get the following:

571   close_range(0, -1, CLOSE_RANGE_CLOEXEC) = -1 EPERM (Operation not permitted)
571   +++ exited with 127 +++

(The numbers beside each syscall is the process id)

mheon commented 3 years ago

Can you verify what profile is in use in the container you're running in? The default Podman profile does allow the syscall, so I have to assume your system may not be using the default

mheon commented 3 years ago

The default profile should live at /usr/share/containers/seccomp.json. However, if an alternative is present at /etc/containers/seccomp.json we will use that one instead.

smac89 commented 3 years ago

Can you verify what profile is in use in the container you're running in? The default Podman profile does allow the syscall, so I have to assume your system may not be using the default

Please how do I do this?

I did:

podman create <image_hash>
podman inspect <container_name>
The output:

``` [ { "Id": "03803ce5d0f2421e3e4a0778c9262d834f91ad3df96ff962147cb767667d4478", "Created": "2021-04-25T17:29:52.519720505-06:00", "Path": "/app/walk", "Args": [ "/app/walk" ], "State": { "OciVersion": "1.0.2-dev", "Status": "configured", "Running": false, "Paused": false, "Restarting": false, "OOMKilled": false, "Dead": false, "Pid": 0, "ExitCode": 0, "Error": "", "StartedAt": "0001-01-01T00:00:00Z", "FinishedAt": "0001-01-01T00:00:00Z", "Healthcheck": { "Status": "", "FailingStreak": 0, "Log": null } }, "Image": "b8adaf3fbdcf539038216f0e061b638003ac8708cc6933177dd9f8dba0c4cd4e", "ImageName": "b8adaf3fbdc", "Rootfs": "", "Pod": "", "ResolvConfPath": "", "HostnamePath": "", "HostsPath": "", "StaticDir": "/home/chigozirim/.local/share/containers/storage/overlay-containers/03803ce5d0f2421e3e4a0778c9262d834f91ad3df96ff962147cb767667d4478/userdata", "OCIRuntime": "crun", "ConmonPidFile": "/run/user/1000/containers/overlay-containers/03803ce5d0f2421e3e4a0778c9262d834f91ad3df96ff962147cb767667d4478/userdata/conmon.pid", "Name": "priceless_meninsky", "RestartCount": 0, "Driver": "overlay", "MountLabel": "", "ProcessLabel": "", "AppArmorProfile": "", "EffectiveCaps": [ "CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_FOWNER", "CAP_FSETID", "CAP_KILL", "CAP_NET_BIND_SERVICE", "CAP_SETFCAP", "CAP_SETGID", "CAP_SETPCAP", "CAP_SETUID", "CAP_SYS_CHROOT" ], "BoundingCaps": [ "CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_FOWNER", "CAP_FSETID", "CAP_KILL", "CAP_NET_BIND_SERVICE", "CAP_SETFCAP", "CAP_SETGID", "CAP_SETPCAP", "CAP_SETUID", "CAP_SYS_CHROOT" ], "ExecIDs": [], "GraphDriver": { "Name": "overlay", "Data": { "LowerDir": "/home/chigozirim/.local/share/containers/storage/overlay/899938f8a7d4f906eda9dda6f1a413cd792177f6cb2af01d18fd215eab659cd5/diff:/home/chigozirim/.local/share/containers/storage/overlay/30d61bb737bb9be7178afce441d0ca5098909a59001a0301d3b50544e659ace1/diff", "UpperDir": "/home/chigozirim/.local/share/containers/storage/overlay/dbc15944f329eec9343405100a0d3095cffd6b0ed5885f365cdfbb7e327817fc/diff", "WorkDir": "/home/chigozirim/.local/share/containers/storage/overlay/dbc15944f329eec9343405100a0d3095cffd6b0ed5885f365cdfbb7e327817fc/work" } }, "Mounts": [], "Dependencies": [], "NetworkSettings": { "EndpointID": "", "Gateway": "", "IPAddress": "", "IPPrefixLen": 0, "IPv6Gateway": "", "GlobalIPv6Address": "", "GlobalIPv6PrefixLen": 0, "MacAddress": "", "Bridge": "", "SandboxID": "", "HairpinMode": false, "LinkLocalIPv6Address": "", "LinkLocalIPv6PrefixLen": 0, "Ports": {}, "SandboxKey": "" }, "ExitCommand": [ "/usr/bin/podman", "--root", "/home/chigozirim/.local/share/containers/storage", "--runroot", "/run/user/1000/containers", "--log-level", "warning", "--cgroup-manager", "systemd", "--tmpdir", "/run/user/1000/libpod/tmp", "--runtime", "crun", "--storage-driver", "overlay", "--storage-opt", "overlay.mount_program=/usr/bin/fuse-overlayfs", "--events-backend", "journald", "container", "cleanup", "03803ce5d0f2421e3e4a0778c9262d834f91ad3df96ff962147cb767667d4478" ], "Namespace": "", "IsInfra": false, "Config": { "Hostname": "03803ce5d0f2", "Domainname": "", "User": "", "AttachStdin": false, "AttachStdout": false, "AttachStderr": false, "Tty": false, "OpenStdin": false, "StdinOnce": false, "Env": [ "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", "TERM=xterm", "container=podman" ], "Cmd": null, "Image": "b8adaf3fbdc", "Volumes": null, "WorkingDir": "/", "Entrypoint": "/app/walk", "OnBuild": null, "Labels": { "io.buildah.version": "1.20.1" }, "Annotations": { "io.kubernetes.cri-o.TTY": "false", "io.podman.annotations.autoremove": "FALSE", "io.podman.annotations.init": "FALSE", "io.podman.annotations.privileged": "FALSE", "io.podman.annotations.publish-all": "FALSE" }, "StopSignal": 15, "CreateCommand": [ "podman", "create", "b8adaf3fbdc" ], "Umask": "0022" }, "HostConfig": { "Binds": [], "CgroupManager": "systemd", "CgroupMode": "private", "ContainerIDFile": "", "LogConfig": { "Type": "k8s-file", "Config": null, "Path": "/home/chigozirim/.local/share/containers/storage/overlay-containers/03803ce5d0f2421e3e4a0778c9262d834f91ad3df96ff962147cb767667d4478/userdata/ctr.log", "Tag": "", "Size": "0B" }, "NetworkMode": "slirp4netns", "PortBindings": {}, "RestartPolicy": { "Name": "", "MaximumRetryCount": 0 }, "AutoRemove": false, "VolumeDriver": "", "VolumesFrom": null, "CapAdd": [], "CapDrop": [ "CAP_AUDIT_WRITE", "CAP_MKNOD", "CAP_NET_RAW" ], "Dns": [], "DnsOptions": [], "DnsSearch": [], "ExtraHosts": [], "GroupAdd": [], "IpcMode": "private", "Cgroup": "", "Cgroups": "default", "Links": null, "OomScoreAdj": 0, "PidMode": "private", "Privileged": false, "PublishAllPorts": false, "ReadonlyRootfs": false, "SecurityOpt": [], "Tmpfs": {}, "UTSMode": "private", "UsernsMode": "", "ShmSize": 65536000, "Runtime": "oci", "ConsoleSize": [ 0, 0 ], "Isolation": "", "CpuShares": 0, "Memory": 0, "NanoCpus": 0, "CgroupParent": "user.slice", "BlkioWeight": 0, "BlkioWeightDevice": null, "BlkioDeviceReadBps": null, "BlkioDeviceWriteBps": null, "BlkioDeviceReadIOps": null, "BlkioDeviceWriteIOps": null, "CpuPeriod": 0, "CpuQuota": 0, "CpuRealtimePeriod": 0, "CpuRealtimeRuntime": 0, "CpusetCpus": "", "CpusetMems": "", "Devices": [], "DiskQuota": 0, "KernelMemory": 0, "MemoryReservation": 0, "MemorySwap": 0, "MemorySwappiness": 0, "OomKillDisable": false, "PidsLimit": 2048, "Ulimits": [], "CpuCount": 0, "CpuPercent": 0, "IOMaximumIOps": 0, "IOMaximumBandwidth": 0, "CgroupConf": null } } ] ```

smac89 commented 3 years ago

I've also checked the installed profile (both /usr/share/containers/seccomp.json and /etc/containers/seccomp.json are the same), and here it is:

seccomp.json

``` { "defaultAction": "SCMP_ACT_ERRNO", "archMap": [ { "architecture": "SCMP_ARCH_X86_64", "subArchitectures": [ "SCMP_ARCH_X86", "SCMP_ARCH_X32" ] }, { "architecture": "SCMP_ARCH_AARCH64", "subArchitectures": [ "SCMP_ARCH_ARM" ] }, { "architecture": "SCMP_ARCH_MIPS64", "subArchitectures": [ "SCMP_ARCH_MIPS", "SCMP_ARCH_MIPS64N32" ] }, { "architecture": "SCMP_ARCH_MIPS64N32", "subArchitectures": [ "SCMP_ARCH_MIPS", "SCMP_ARCH_MIPS64" ] }, { "architecture": "SCMP_ARCH_MIPSEL64", "subArchitectures": [ "SCMP_ARCH_MIPSEL", "SCMP_ARCH_MIPSEL64N32" ] }, { "architecture": "SCMP_ARCH_MIPSEL64N32", "subArchitectures": [ "SCMP_ARCH_MIPSEL", "SCMP_ARCH_MIPSEL64" ] }, { "architecture": "SCMP_ARCH_S390X", "subArchitectures": [ "SCMP_ARCH_S390" ] } ], "syscalls": [ { "names": [ "_llseek", "_newselect", "accept", "accept4", "access", "adjtimex", "alarm", "bind", "brk", "capget", "capset", "chdir", "chmod", "chown", "chown32", "clock_adjtime", "clock_adjtime64", "clock_getres", "clock_getres_time64", "clock_gettime", "clock_gettime64", "clock_nanosleep", "clock_nanosleep_time64", "clone", "close", "close_range", "connect", "copy_file_range", "creat", "dup", "dup2", "dup3", "epoll_create", "epoll_create1", "epoll_ctl", "epoll_ctl_old", "epoll_pwait", "epoll_pwait2", "epoll_wait", "epoll_wait_old", "eventfd", "eventfd2", "execve", "execveat", "exit", "exit_group", "faccessat", "faccessat2", "fadvise64", "fadvise64_64", "fallocate", "fanotify_mark", "fchdir", "fchmod", "fchmodat", "fchown", "fchown32", "fchownat", "fcntl", "fcntl64", "fdatasync", "fgetxattr", "flistxattr", "flock", "fork", "fremovexattr", "fsconfig", "fsetxattr", "fsmount", "fsopen", "fspick", "fstat", "fstat64", "fstatat64", "fstatfs", "fstatfs64", "fsync", "ftruncate", "ftruncate64", "futex", "futimesat", "get_robust_list", "get_thread_area", "getcpu", "getcwd", "getdents", "getdents64", "getegid", "getegid32", "geteuid", "geteuid32", "getgid", "getgid32", "getgroups", "getgroups32", "getitimer", "getpeername", "getpgid", "getpgrp", "getpid", "getppid", "getpriority", "getrandom", "getresgid", "getresgid32", "getresuid", "getresuid32", "getrlimit", "getrusage", "getsid", "getsockname", "getsockopt", "gettid", "gettimeofday", "getuid", "getuid32", "getxattr", "inotify_add_watch", "inotify_init", "inotify_init1", "inotify_rm_watch", "io_cancel", "io_destroy", "io_getevents", "io_setup", "io_submit", "ioctl", "ioprio_get", "ioprio_set", "ipc", "keyctl", "kill", "lchown", "lchown32", "lgetxattr", "link", "linkat", "listen", "listxattr", "llistxattr", "lremovexattr", "lseek", "lsetxattr", "lstat", "lstat64", "madvise", "memfd_create", "mincore", "mkdir", "mkdirat", "mknod", "mknodat", "mlock", "mlock2", "mlockall", "mmap", "mmap2", "mount", "move_mount", "mprotect", "mq_getsetattr", "mq_notify", "mq_open", "mq_timedreceive", "mq_timedsend", "mq_unlink", "mremap", "msgctl", "msgget", "msgrcv", "msgsnd", "msync", "munlock", "munlockall", "munmap", "name_to_handle_at", "nanosleep", "newfstatat", "open", "openat", "openat2", "open_tree", "pause", "pidfd_getfd", "pidfd_open", "pidfd_send_signal", "pipe", "pipe2", "pivot_root", "poll", "ppoll", "ppoll_time64", "prctl", "pread64", "preadv", "preadv2", "prlimit64", "pselect6", "pselect6_time64", "pwrite64", "pwritev", "pwritev2", "read", "readahead", "readlink", "readlinkat", "readv", "reboot", "recv", "recvfrom", "recvmmsg", "recvmsg", "remap_file_pages", "removexattr", "rename", "renameat", "renameat2", "restart_syscall", "rmdir", "rt_sigaction", "rt_sigpending", "rt_sigprocmask", "rt_sigqueueinfo", "rt_sigreturn", "rt_sigsuspend", "rt_sigtimedwait", "rt_tgsigqueueinfo", "sched_get_priority_max", "sched_get_priority_min", "sched_getaffinity", "sched_getattr", "sched_getparam", "sched_getscheduler", "sched_rr_get_interval", "sched_setaffinity", "sched_setattr", "sched_setparam", "sched_setscheduler", "sched_yield", "seccomp", "select", "semctl", "semget", "semop", "semtimedop", "send", "sendfile", "sendfile64", "sendmmsg", "sendmsg", "sendto", "setns", "set_robust_list", "set_thread_area", "set_tid_address", "setfsgid", "setfsgid32", "setfsuid", "setfsuid32", "setgid", "setgid32", "setgroups", "setgroups32", "setitimer", "setpgid", "setpriority", "setregid", "setregid32", "setresgid", "setresgid32", "setresuid", "setresuid32", "setreuid", "setreuid32", "setrlimit", "setsid", "setsockopt", "setuid", "setuid32", "setxattr", "shmat", "shmctl", "shmdt", "shmget", "shutdown", "sigaltstack", "signalfd", "signalfd4", "sigreturn", "socketcall", "socketpair", "splice", "stat", "stat64", "statfs", "statfs64", "statx", "symlink", "symlinkat", "sync", "sync_file_range", "syncfs", "sysinfo", "syslog", "tee", "tgkill", "time", "timer_create", "timer_delete", "timer_getoverrun", "timer_gettime", "timer_gettime64", "timer_settime", "timerfd_create", "timerfd_gettime", "timerfd_gettime64", "timerfd_settime", "timerfd_settime64", "times", "tkill", "truncate", "truncate64", "ugetrlimit", "umask", "umount", "umount2", "uname", "unlink", "unlinkat", "unshare", "utime", "utimensat", "utimensat_time64", "utimes", "vfork", "wait4", "waitid", "waitpid", "write", "writev" ], "action": "SCMP_ACT_ALLOW", "args": [], "comment": "", "includes": {}, "excludes": {} }, { "names": [ "personality" ], "action": "SCMP_ACT_ALLOW", "args": [ { "index": 0, "value": 0, "valueTwo": 0, "op": "SCMP_CMP_EQ" } ], "comment": "", "includes": {}, "excludes": {} }, { "names": [ "personality" ], "action": "SCMP_ACT_ALLOW", "args": [ { "index": 0, "value": 8, "valueTwo": 0, "op": "SCMP_CMP_EQ" } ], "comment": "", "includes": {}, "excludes": {} }, { "names": [ "personality" ], "action": "SCMP_ACT_ALLOW", "args": [ { "index": 0, "value": 131072, "valueTwo": 0, "op": "SCMP_CMP_EQ" } ], "comment": "", "includes": {}, "excludes": {} }, { "names": [ "personality" ], "action": "SCMP_ACT_ALLOW", "args": [ { "index": 0, "value": 131080, "valueTwo": 0, "op": "SCMP_CMP_EQ" } ], "comment": "", "includes": {}, "excludes": {} }, { "names": [ "personality" ], "action": "SCMP_ACT_ALLOW", "args": [ { "index": 0, "value": 4294967295, "valueTwo": 0, "op": "SCMP_CMP_EQ" } ], "comment": "", "includes": {}, "excludes": {} }, { "names": [ "sync_file_range2" ], "action": "SCMP_ACT_ALLOW", "args": [], "comment": "", "includes": { "arches": [ "ppc64le" ] }, "excludes": {} }, { "names": [ "arm_fadvise64_64", "arm_sync_file_range", "sync_file_range2", "breakpoint", "cacheflush", "set_tls" ], "action": "SCMP_ACT_ALLOW", "args": [], "comment": "", "includes": { "arches": [ "arm", "arm64" ] }, "excludes": {} }, { "names": [ "arch_prctl" ], "action": "SCMP_ACT_ALLOW", "args": [], "comment": "", "includes": { "arches": [ "amd64", "x32" ] }, "excludes": {} }, { "names": [ "modify_ldt" ], "action": "SCMP_ACT_ALLOW", "args": [], "comment": "", "includes": { "arches": [ "amd64", "x32", "x86" ] }, "excludes": {} }, { "names": [ "s390_pci_mmio_read", "s390_pci_mmio_write", "s390_runtime_instr" ], "action": "SCMP_ACT_ALLOW", "args": [], "comment": "", "includes": { "arches": [ "s390", "s390x" ] }, "excludes": {} }, { "names": [ "open_by_handle_at" ], "action": "SCMP_ACT_ALLOW", "args": [], "comment": "", "includes": { "caps": [ "CAP_DAC_READ_SEARCH" ] }, "excludes": {} }, { "names": [ "bpf", "fanotify_init", "lookup_dcookie", "perf_event_open", "quotactl", "setdomainname", "sethostname", "setns" ], "action": "SCMP_ACT_ALLOW", "args": [], "comment": "", "includes": { "caps": [ "CAP_SYS_ADMIN" ] }, "excludes": {} }, { "names": [ "chroot" ], "action": "SCMP_ACT_ALLOW", "args": [], "comment": "", "includes": { "caps": [ "CAP_SYS_CHROOT" ] }, "excludes": {} }, { "names": [ "delete_module", "init_module", "finit_module", "query_module" ], "action": "SCMP_ACT_ALLOW", "args": [], "comment": "", "includes": { "caps": [ "CAP_SYS_MODULE" ] }, "excludes": {} }, { "names": [ "get_mempolicy", "mbind", "set_mempolicy" ], "action": "SCMP_ACT_ALLOW", "args": [], "comment": "", "includes": { "caps": [ "CAP_SYS_NICE" ] }, "excludes": {} }, { "names": [ "acct" ], "action": "SCMP_ACT_ALLOW", "args": [], "comment": "", "includes": { "caps": [ "CAP_SYS_PACCT" ] }, "excludes": {} }, { "names": [ "kcmp", "process_madvise", "process_vm_readv", "process_vm_writev", "ptrace" ], "action": "SCMP_ACT_ALLOW", "args": [], "comment": "", "includes": { "caps": [ "CAP_SYS_PTRACE" ] }, "excludes": {} }, { "names": [ "iopl", "ioperm" ], "action": "SCMP_ACT_ALLOW", "args": [], "comment": "", "includes": { "caps": [ "CAP_SYS_RAWIO" ] }, "excludes": {} }, { "names": [ "settimeofday", "stime", "clock_settime", "clock_settime64" ], "action": "SCMP_ACT_ALLOW", "args": [], "comment": "", "includes": { "caps": [ "CAP_SYS_TIME" ] }, "excludes": {} }, { "names": [ "vhangup" ], "action": "SCMP_ACT_ALLOW", "args": [], "comment": "", "includes": { "caps": [ "CAP_SYS_TTY_CONFIG" ] }, "excludes": {} }, { "names": [ "socket" ], "action": "SCMP_ACT_ERRNO", "args": [ { "index": 0, "value": 16, "valueTwo": 0, "op": "SCMP_CMP_EQ" }, { "index": 2, "value": 9, "valueTwo": 0, "op": "SCMP_CMP_EQ" } ], "comment": "", "includes": {}, "excludes": { "caps": [ "CAP_AUDIT_WRITE" ] }, "errnoRet": 22 }, { "names": [ "socket" ], "action": "SCMP_ACT_ALLOW", "args": [ { "index": 2, "value": 9, "valueTwo": 0, "op": "SCMP_CMP_NE" } ], "comment": "", "includes": {}, "excludes": { "caps": [ "CAP_AUDIT_WRITE" ] } }, { "names": [ "socket" ], "action": "SCMP_ACT_ALLOW", "args": [ { "index": 0, "value": 16, "valueTwo": 0, "op": "SCMP_CMP_NE" } ], "comment": "", "includes": {}, "excludes": { "caps": [ "CAP_AUDIT_WRITE" ] } }, { "names": [ "socket" ], "action": "SCMP_ACT_ALLOW", "args": [ { "index": 2, "value": 9, "valueTwo": 0, "op": "SCMP_CMP_NE" } ], "comment": "", "includes": {}, "excludes": { "caps": [ "CAP_AUDIT_WRITE" ] } }, { "names": [ "socket" ], "action": "SCMP_ACT_ALLOW", "args": null, "comment": "", "includes": { "caps": [ "CAP_AUDIT_WRITE" ] }, "excludes": {} } ] } ```

mheon commented 3 years ago

Your Seccomp profile does include close_range in the list of allowed calls, so Podman and Libseccomp should not be generating profiles that block it. It's not conditional in any way, either - allowed without any checks.

rhatdan commented 3 years ago

You should see the denied seccomp call in /var/log/audit/audit.log

ausearch -m seccomp -i

smac89 commented 3 years ago

You should see the denied seccomp call in /var/log/audit/audit.log

ausearch -m seccomp -i

@rhatdan

I do:

----
type=SECCOMP msg=audit(2021-04-27 14:04:24.740:425) : auid=chigozirim uid=unknown(10099) gid=unknown(10099) ses=2 subj==unconfined pid=190649 comm=xfce4-terminal exe=/usr/bin/xfce4-terminal sig=SIG0 arch=x86_64 syscall=close_range compat=0 ip=0x7f76336d8a9d code=errno

Like I said, this only happens inside the container. On my host machine, the problem never occurs

rhatdan commented 3 years ago

Something is going wrong then, some kind of mismatch between what the OCI Runtime understands is close_range and what the kernel does. You see close_range in /usr/share/containers/seccomp.json correct?

I just wrote a quick patch to podman info to show what seccomp.json file the tool is using.

smac89 commented 3 years ago

You see close_range in /usr/share/containers/seccomp.json correct?

Indeed I do

➜ grep -C4 'close_range' /usr/share/containers/seccomp.json
                "clock_nanosleep",
                "clock_nanosleep_time64",
                "clone",
                "close",
                "close_range",
                "connect",
                "copy_file_range",
                "creat",
                "dup",
rhatdan commented 3 years ago

Are you using runc or crun?

@giuseppe ideas?

smac89 commented 3 years ago

I am using crun. I can switch back to runc and test it. Let me do that.

The same issue with runc

smac89 commented 3 years ago

Also when I switch to runc, the error is not detected by auditd (i.e. I don't see it in the logs), but when I strace the command, I see that it still ends at close_range:

[pid   706] close_range(0, -1, CLOSE_RANGE_CLOEXEC) = -1 EPERM (Operation not permitted)
giuseppe commented 3 years ago

@giuseppe ideas?

close_range is used by crun.

This is again the same issue with EPERM vs ENOSYS we already faced few months ago.

I think it is time we switch to use ENOSYS by default, the only issue AFAIK is that runc doesn't support yet (https://github.com/opencontainers/runtime-spec/pull/1087).

CC @kolyshkin

paravz commented 3 years ago

Seeing same issue on F33, starting container with --security-opt=seccomp=unconfined solves it.


$  grep -C4 'close_range' /usr/share/containers/seccomp.json
                                "clock_nanosleep",
                                "clock_nanosleep_time64",
                                "clone",
                                "close",
                                "close_range",
                                "connect",
                                "copy_file_range",
                                "creat",
                                "dup",

$  rpm -qf /usr/share/containers/seccomp.json
containers-common-1-10.fc33.noarch

$  rpm -q podman runc crun
podman-3.1.0-3.fc33.x86_64
runc-1.0.0-377.rc93.fc33.x86_64
crun-0.19.1-2.fc33.x86_64

#Edit: seccomp audit message:
audit[1112353]: SECCOMP auid=1000 uid=1000 gid=1000 ses=3 subj=system_u:system_r:container_init_t:s0:c344,c914 pid=1112353 comm="xfce4-terminal" exe="/usr/bin/xfce4-terminal" sig=0 arch=c000003e syscall=436 compat=0 ip=0x7f6184e4f15d code=0x50000
paravz commented 3 years ago

@smac89 love your bug report, so easy to reproduce!

giuseppe commented 3 years ago

we also need an updated libseccomp that knows about close_range and apparently it is not present even upstream at the moment

rhatdan commented 3 years ago

@giuseppe Did you open a PR with libseccomp to add this?

giuseppe commented 3 years ago

I think at this point it is easier to fix it for good in our default seccomp profile now that runc rc95 is out and with the feature we need. Also libseccomp uses some scripts to read all the syscalls from the kernel sources, so it is not necessary to update it manually

rhatdan commented 3 years ago

Ok what is our next steps then? Do we need a new PR to Podman? Containers-common?

giuseppe commented 3 years ago

PR opened here: https://github.com/containers/common/pull/573

github-actions[bot] commented 3 years ago

A friendly reminder that this issue had no activity for 30 days.

giuseppe commented 3 years ago

this is fixed in c/common

rmsc commented 3 years ago

I'm still facing this same issue with containers/common-0.40.1:

$ pacman -Ss containers-common
community/containers-common 0.40.1-2 [installed]
    Configuration files and manpages for containers
 $ podman run --rm -it a83749b0c3fdecb23737bcbc591262cbd8fc91f517b5d61106273d1965658320 /app/walk.c
/app/walk.c opened as FD 3
/proc/self/fd/0 ==> /dev/pts/0
/proc/self/fd/1 ==> /dev/pts/0
/proc/self/fd/2 ==> /dev/pts/0
/proc/self/fd/3 ==> /app/walk.c
/proc/self/fd/4 ==> /proc/1/fd
========= About to call close_range() =======
close_range: Operation not permitted
rmsc commented 3 years ago

In my case it seems that a stale config file was probably to blame. Removing and reinstalling the files in /etc/containers fixed this for me. Sorry for the noise.

EDIT: the problem is actually still here.

rmsc commented 3 years ago

I just triple checked, and I'm now in a very weird situation:

I'm now hitting another error (Error: capset: Operation not permitted: OCI permission denied), but that is in a podman-in-podman situation, and easy to workaround for now with --drop-caps all.

erikarvstedt commented 3 years ago

Support for close_range has only recently been added to seccomp: https://github.com/seccomp/libseccomp/commit/ac849e7960547d418009a783da654d5917dbfe2d

idleroamer commented 3 years ago

~~in the dunfell yocto build on even podman 3.4.2 with https://github.com/seccomp/libseccomp/commit/ac849e7960547d418009a783da654d5917dbfe2d I still observe the same defect. Error: OCI runtime error: invalid seccomp syscall 'close_range'~~ never mind updating "crun" from 0.10 to 0.19 fixed the issue.