containers / youki

A container runtime written in Rust
https://containers.github.io/youki/
Apache License 2.0
5.96k stars 331 forks source link

`runc` differences given the same `config.json` #2756

Open jeromegn opened 2 months ago

jeromegn commented 2 months ago

I've been troubleshooting some issues executing into a running container and as I tried comparing with runc, I noticed it worked as I expected it.

Here's the setup procedure I have for my test:

$ mkdir mycontainer

$ cd mycontainer

$ mkdir rootfs

$ docker export $(docker create debian:bookworm-slim) | tar -C rootfs -xvf -
Spec config.json: ```json { "ociVersion": "1.0.2-dev", "process": { "terminal": false, "user": { "uid": 0, "gid": 0 }, "args": [ "sleep", "100000" ], "env": [ "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", "TERM=xterm" ], "cwd": "/", "capabilities": { "bounding": [ "CAP_AUDIT_WRITE", "CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_FOWNER", "CAP_FSETID", "CAP_KILL", "CAP_MKNOD", "CAP_NET_BIND_SERVICE", "CAP_NET_RAW", "CAP_SETFCAP", "CAP_SETGID", "CAP_SETPCAP", "CAP_SETUID", "CAP_SYS_ADMIN", "CAP_SYS_CHROOT" ], "effective": [ "CAP_AUDIT_WRITE", "CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_FOWNER", "CAP_FSETID", "CAP_KILL", "CAP_MKNOD", "CAP_NET_BIND_SERVICE", "CAP_NET_RAW", "CAP_SETFCAP", "CAP_SETGID", "CAP_SETPCAP", "CAP_SETUID", "CAP_SYS_ADMIN", "CAP_SYS_CHROOT" ], "permitted": [ "CAP_AUDIT_WRITE", "CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_FOWNER", "CAP_FSETID", "CAP_KILL", "CAP_MKNOD", "CAP_NET_BIND_SERVICE", "CAP_NET_RAW", "CAP_SETFCAP", "CAP_SETGID", "CAP_SETPCAP", "CAP_SETUID", "CAP_SYS_ADMIN", "CAP_SYS_CHROOT" ], "ambient": [ "CAP_AUDIT_WRITE", "CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_FOWNER", "CAP_FSETID", "CAP_KILL", "CAP_MKNOD", "CAP_NET_BIND_SERVICE", "CAP_NET_RAW", "CAP_SETFCAP", "CAP_SETGID", "CAP_SETPCAP", "CAP_SETUID", "CAP_SYS_ADMIN", "CAP_SYS_CHROOT" ] }, "rlimits": [ { "type": "RLIMIT_NOFILE", "hard": 1024, "soft": 1024 } ], "noNewPrivileges": true }, "root": { "path": "rootfs", "readonly": false }, "hostname": "runc", "mounts": [ { "destination": "/proc", "type": "proc", "source": "proc" }, { "destination": "/dev", "type": "tmpfs", "source": "tmpfs", "options": [ "nosuid", "strictatime", "mode=755", "size=65536k" ] }, { "destination": "/dev/pts", "type": "devpts", "source": "devpts", "options": [ "nosuid", "noexec", "newinstance", "ptmxmode=0666", "mode=0620", "gid=5" ] }, { "destination": "/dev/shm", "type": "tmpfs", "source": "shm", "options": [ "nosuid", "noexec", "nodev", "mode=1777", "size=65536k" ] }, { "destination": "/dev/mqueue", "type": "mqueue", "source": "mqueue", "options": [ "nosuid", "noexec", "nodev" ] }, { "destination": "/sys", "source": "/sys", "options": [ "rbind", "nosuid", "noexec", "nodev", "ro" ] }, { "destination": "/sys/fs/cgroup", "type": "cgroup", "source": "cgroup", "options": [ "nosuid", "noexec", "nodev", "relatime", "ro" ] } ], "linux": { "resources": { "devices": [ { "allow": false, "access": "rwm" } ] }, "namespaces": [ { "type": "pid" }, { "type": "ipc" }, { "type": "uts" }, { "type": "mount" }, { "type": "cgroup" } ], "maskedPaths": [ "/proc/acpi", "/proc/asound", "/proc/kcore", "/proc/keys", "/proc/latency_stats", "/proc/timer_list", "/proc/timer_stats", "/proc/sched_debug", "/sys/firmware", "/proc/scsi" ], "readonlyPaths": [ "/proc/bus", "/proc/fs", "/proc/irq", "/proc/sys", "/proc/sysrq-trigger" ] } } ```
Running with runc: ``` $ sudo runc --debug create runctest DEBU[0000] nsexec[458717]: => nsexec container setup DEBU[0000] nsexec-0[458717]: ~> nsexec stage-0 DEBU[0000] nsexec-0[458717]: spawn stage-1 DEBU[0000] nsexec-0[458717]: -> stage-1 synchronisation loop DEBU[0000] nsexec-1[458720]: ~> nsexec stage-1 DEBU[0000] nsexec-1[458720]: unshare remaining namespaces (except cgroupns) DEBU[0000] nsexec-1[458720]: spawn stage-2 DEBU[0000] nsexec-1[458720]: request stage-0 to forward stage-2 pid (458721) DEBU[0000] nsexec-0[458717]: stage-1 requested pid to be forwarded DEBU[0000] nsexec-0[458717]: forward stage-1 (458720) and stage-2 (458721) pids to runc DEBU[0000] nsexec-1[458720]: signal completion to stage-0 DEBU[0000] nsexec-1[458720]: <~ nsexec stage-1 DEBU[0000] nsexec-2[1]: ~> nsexec stage-2 DEBU[0000] nsexec-0[458717]: stage-1 complete DEBU[0000] nsexec-0[458717]: <- stage-1 synchronisation loop DEBU[0000] nsexec-0[458717]: -> stage-2 synchronisation loop DEBU[0000] nsexec-0[458717]: signalling stage-2 to run DEBU[0000] nsexec-2[1]: unshare cgroup namespace DEBU[0000] nsexec-2[1]: signal completion to stage-0 DEBU[0000] nsexec-2[1]: <= nsexec container setup DEBU[0000] nsexec-2[1]: booting up go runtime ... DEBU[0000] nsexec-0[458717]: stage-2 complete DEBU[0000] nsexec-0[458717]: <- stage-2 synchronisation loop DEBU[0000] nsexec-0[458717]: <~ nsexec stage-0 DEBU[0000] child process in init() DEBU[0000] init: closing the pipe to signal completion $ sudo runc --debug start runctest $ sudo runc --debug exec -t runctest /bin/bash DEBU[0000] nsexec[458594]: => nsexec container setup DEBU[0000] nsexec[458594]: set process as non-dumpable DEBU[0000] nsexec-0[458594]: ~> nsexec stage-0 DEBU[0000] nsexec-0[458594]: spawn stage-1 DEBU[0000] nsexec-0[458594]: -> stage-1 synchronisation loop DEBU[0000] nsexec-1[458597]: ~> nsexec stage-1 DEBU[0000] nsexec-1[458597]: setns(0x8000000) into ipc namespace (with path /proc/458113/ns/ipc) DEBU[0000] nsexec-1[458597]: setns(0x4000000) into uts namespace (with path /proc/458113/ns/uts) DEBU[0000] nsexec-1[458597]: setns(0x20000000) into pid namespace (with path /proc/458113/ns/pid) DEBU[0000] nsexec-1[458597]: setns(0x20000) into mnt namespace (with path /proc/458113/ns/mnt) DEBU[0000] nsexec-1[458597]: setns(0x2000000) into cgroup namespace (with path /proc/458113/ns/cgroup) DEBU[0000] nsexec-1[458597]: unshare remaining namespaces (except cgroupns) DEBU[0000] nsexec-1[458597]: spawn stage-2 DEBU[0000] nsexec-1[458597]: request stage-0 to forward stage-2 pid (458598) DEBU[0000] nsexec-0[458594]: stage-1 requested pid to be forwarded DEBU[0000] nsexec-0[458594]: forward stage-1 (458597) and stage-2 (458598) pids to runc DEBU[0000] nsexec-1[458597]: signal completion to stage-0 DEBU[0000] nsexec-1[458597]: <~ nsexec stage-1 DEBU[0000] nsexec-0[458594]: stage-1 complete DEBU[0000] nsexec-0[458594]: <- stage-1 synchronisation loop DEBU[0000] nsexec-0[458594]: -> stage-2 synchronisation loop DEBU[0000] nsexec-0[458594]: signalling stage-2 to run DEBU[0000] nsexec-2[17]: ~> nsexec stage-2 DEBU[0000] nsexec-2[17]: signal completion to stage-0 DEBU[0000] nsexec-2[17]: <= nsexec container setup DEBU[0000] nsexec-2[17]: booting up go runtime ... DEBU[0000] nsexec-0[458594]: stage-2 complete DEBU[0000] nsexec-0[458594]: <- stage-2 synchronisation loop DEBU[0000] nsexec-0[458594]: <~ nsexec stage-0 DEBU[0000] child process in init() DEBU[0000] setns_init: about to exec DEBU[0000]signals.go:102 main.(*signalHandler).forward() sending signal to process urgent I/O condition root@runc:/# apt update Ign:1 http://archive.ubuntu.com/ubuntu jammy InRelease Ign:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease Ign:3 http://security.ubuntu.com/ubuntu jammy-security InRelease Ign:4 http://archive.ubuntu.com/ubuntu jammy-backports InRelease Ign:1 http://archive.ubuntu.com/ubuntu jammy InRelease Ign:3 http://security.ubuntu.com/ubuntu jammy-security InRelease Ign:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease Ign:4 http://archive.ubuntu.com/ubuntu jammy-backports InRelease 0% [Working]^C root@runc:/# exit ```
Running with youki: ``` $ sudo ../youki/target/debug/youki create -b mycontainer youkitest DEBUG youki: started by user 0 with ArgsOs { inner: ["../youki/target/debug/youki", "create", "-b", "mycontainer", "youkitest"] } DEBUG libcontainer::user_ns: this container does NOT create a new user namespace DEBUG libcontainer::container::init_builder: container directory will be "/run/youki/youkitest" DEBUG libcontainer::container::container: Save container status: Container { state: State { oci_version: "v1.0.2", id: "youkitest", status: Creating, pid: None, bundle: "/home/jerome/src/github.com/superfly/init/mycontainer", annotations: Some({}), created: None, creator: None, use_systemd: false, clean_up_intel_rdt_subdirectory: None }, root: "/run/youki/youkitest" } in "/run/youki/youkitest" DEBUG libcontainer::user_ns: this container does NOT create a new user namespace DEBUG libcontainer::notify_socket: create notify listener socket_path="/run/youki/youkitest/notify.sock" DEBUG libcontainer::notify_socket: the cwd to create the notify socket cwd="/run/youki/youkitest" INFO libcgroups::common: cgroup manager V2 will be used WARN libcgroups::v2::util: Controller rdma is not yet implemented. WARN libcgroups::v2::util: Controller misc is not yet implemented. DEBUG libcgroups::v2::hugetlb: Apply hugetlb cgroup v2 config DEBUG libcgroups::v2::io: Apply io cgroup v2 config DEBUG libcgroups::v2::pids: Apply pids cgroup v2 config WARN libcgroups::v2::util: Controller rdma is not yet implemented. WARN libcgroups::v2::util: Controller misc is not yet implemented. DEBUG libcontainer::namespaces: unshare or setns: LinuxNamespace { typ: Pid, path: None } DEBUG libcontainer::process::channel: sending init pid (Pid(457403)) DEBUG libcontainer::namespaces: unshare or setns: LinuxNamespace { typ: Uts, path: None } DEBUG libcontainer::namespaces: unshare or setns: LinuxNamespace { typ: Ipc, path: None } DEBUG libcontainer::namespaces: unshare or setns: LinuxNamespace { typ: Cgroup, path: None } DEBUG libcontainer::namespaces: unshare or setns: LinuxNamespace { typ: Mount, path: None } DEBUG libcontainer::rootfs::rootfs: prepare rootfs rootfs="/home/jerome/src/github.com/superfly/init/mycontainer/rootfs" DEBUG libcontainer::rootfs::rootfs: mount root fs "/home/jerome/src/github.com/superfly/init/mycontainer/rootfs" DEBUG libcontainer::rootfs::mount: mounting Mount { destination: "/proc", typ: Some("proc"), source: Some("proc"), options: None } DEBUG libcontainer::rootfs::mount: mounting with options: MountOptionConfig { flags: MsFlags(0x0), data: "", rec_attr: None } DEBUG libcontainer::rootfs::mount: mounting Mount { destination: "/dev", typ: Some("tmpfs"), source: Some("tmpfs"), options: Some(["nosuid", "strictatime", "mode=755", "size=65536k"]) } DEBUG libcontainer::rootfs::mount: mounting with options: MountOptionConfig { flags: MsFlags(MS_NOSUID), data: "mode=755,size=65536k", rec_attr: None } DEBUG libcontainer::rootfs::mount: mounting Mount { destination: "/dev/pts", typ: Some("devpts"), source: Some("devpts"), options: Some(["nosuid", "noexec", "newinstance", "ptmxmode=0666", "mode=0620", "gid=5"]) } DEBUG libcontainer::rootfs::mount: mounting with options: MountOptionConfig { flags: MsFlags(MS_NOSUID | MS_NOEXEC), data: "newinstance,ptmxmode=0666,mode=0620,gid=5", rec_attr: None } DEBUG libcontainer::rootfs::mount: mounting Mount { destination: "/dev/shm", typ: Some("tmpfs"), source: Some("shm"), options: Some(["nosuid", "noexec", "nodev", "mode=1777", "size=65536k"]) } DEBUG libcontainer::rootfs::mount: mounting with options: MountOptionConfig { flags: MsFlags(MS_NOSUID | MS_NODEV | MS_NOEXEC), data: "mode=1777,size=65536k", rec_attr: None } DEBUG libcontainer::rootfs::mount: mounting Mount { destination: "/dev/mqueue", typ: Some("mqueue"), source: Some("mqueue"), options: Some(["nosuid", "noexec", "nodev"]) } DEBUG libcontainer::rootfs::mount: mounting with options: MountOptionConfig { flags: MsFlags(MS_NOSUID | MS_NODEV | MS_NOEXEC), data: "", rec_attr: None } DEBUG libcontainer::rootfs::mount: mounting Mount { destination: "/sys", typ: None, source: Some("/sys"), options: Some(["rbind", "nosuid", "noexec", "nodev", "ro"]) } DEBUG libcontainer::rootfs::mount: mounting with options: MountOptionConfig { flags: MsFlags(MS_RDONLY | MS_NOSUID | MS_NODEV | MS_NOEXEC | MS_BIND | MS_REC), data: "", rec_attr: None } DEBUG libcontainer::rootfs::mount: mounting Mount { destination: "/sys/fs/cgroup", typ: Some("cgroup"), source: Some("cgroup"), options: Some(["nosuid", "noexec", "nodev", "relatime", "ro"]) } DEBUG libcontainer::rootfs::mount: Mounting cgroup v2 filesystem DEBUG libcontainer::rootfs::mount: Mount { destination: "/sys/fs/cgroup", typ: Some("cgroup2"), source: Some("cgroup"), options: Some([]) } DEBUG libcontainer::rootfs::mount: mounting with options: MountOptionConfig { flags: MsFlags(MS_RDONLY | MS_NOSUID | MS_NODEV | MS_NOEXEC), data: "", rec_attr: None } ERROR libcontainer::rootfs::mount: mount of "/sys/fs/cgroup" failed. EBUSY: Device or resource busy DEBUG libcontainer::rootfs::mount: Mount { destination: "/sys/fs/cgroup", typ: Some("bind"), source: Some("/sys/fs/cgroup/"), options: Some([]) } DEBUG libcontainer::rootfs::mount: mounting with options: MountOptionConfig { flags: MsFlags(MS_RDONLY | MS_NOSUID | MS_NODEV | MS_NOEXEC | MS_BIND), data: "", rec_attr: None } DEBUG libcontainer::process::container_init_process: readonly path "/proc/bus" mounted DEBUG libcontainer::process::container_init_process: readonly path "/proc/fs" mounted DEBUG libcontainer::process::container_init_process: readonly path "/proc/irq" mounted DEBUG libcontainer::process::container_init_process: readonly path "/proc/sys" mounted DEBUG libcontainer::process::container_init_process: readonly path "/proc/sysrq-trigger" mounted DEBUG libcontainer::capabilities: reset all caps DEBUG libcontainer::capabilities: dropping bounding capabilities to Some({Setpcap, Mknod, SysChroot, Setuid, AuditWrite, Setgid, Setfcap, SysAdmin, NetRaw, Chown, Fsetid, DacOverride, NetBindService, Kill, Fowner}) ERROR libcontainer::capabilities: failed to set ambient capabilities: failed to set capabilities: caps error: PR_CAP_AMBIENT_RAISE failure: Operation not permitted (os error 1) DEBUG libcontainer::workload::default: found executable in executor executable="/usr/bin/sleep" DEBUG libcontainer::process::container_main_process: init pid is Pid(457403) DEBUG libcontainer::container::container: Save container status: Container { state: State { oci_version: "v1.0.2", id: "youkitest", status: Created, pid: Some(457403), bundle: "/home/jerome/src/github.com/superfly/init/mycontainer", annotations: None, created: Some(2024-04-11T00:11:46.035245621Z), creator: Some(0), use_systemd: false, clean_up_intel_rdt_subdirectory: Some(false) }, root: "/run/youki/youkitest" } in "/run/youki/youkitest" $ sudo ../youki/target/debug/youki start youkitest DEBUG youki: started by user 0 with ArgsOs { inner: ["../youki/target/debug/youki", "start", "youkitest"] } DEBUG libcontainer::notify_socket: notify container start DEBUG libcontainer::notify_socket: notify finished DEBUG libcontainer::container::container: Save container status: Container { state: State { oci_version: "v1.0.2", id: "youkitest", status: Running, pid: Some(457403), bundle: "/home/jerome/src/github.com/superfly/init/mycontainer", annotations: None, created: Some(2024-04-11T00:11:46.035245621Z), creator: Some(0), use_systemd: false, clean_up_intel_rdt_subdirectory: Some(false) }, root: "/run/youki/youkitest" } in "/run/youki/youkitest" $ sudo ../youki/target/debug/youki exec -t youkitest /bin/bash DEBUG youki: started by user 0 with ArgsOs { inner: ["../youki/target/debug/youki", "exec", "-t", "youkitest", "/bin/bash"] } DEBUG libcontainer::user_ns: this container does NOT create a new user namespace DEBUG libcontainer::user_ns: this container does NOT create a new user namespace DEBUG libcontainer::notify_socket: create notify listener socket_path="/run/youki/youkitest/tenant-notify-5929bea.sock" DEBUG libcontainer::notify_socket: the cwd to create the notify socket cwd="/run/youki/youkitest" INFO libcgroups::common: cgroup manager V2 will be used WARN libcgroups::v2::util: Controller rdma is not yet implemented. WARN libcgroups::v2::util: Controller misc is not yet implemented. DEBUG libcontainer::namespaces: unshare or setns: LinuxNamespace { typ: Pid, path: Some("/proc/457403/ns/pid") } DEBUG libcontainer::process::channel: sending init pid (Pid(457525)) DEBUG libcontainer::namespaces: unshare or setns: LinuxNamespace { typ: Uts, path: Some("/proc/457403/ns/uts") } DEBUG libcontainer::namespaces: unshare or setns: LinuxNamespace { typ: Ipc, path: Some("/proc/457403/ns/ipc") } DEBUG libcontainer::namespaces: unshare or setns: LinuxNamespace { typ: Network, path: Some("/proc/457403/ns/net") } DEBUG libcontainer::namespaces: unshare or setns: LinuxNamespace { typ: Cgroup, path: Some("/proc/457403/ns/cgroup") } DEBUG libcontainer::namespaces: unshare or setns: LinuxNamespace { typ: Mount, path: Some("/proc/457403/ns/mnt") } DEBUG libcontainer::process::container_init_process: readonly path "/proc/bus" mounted DEBUG libcontainer::process::container_init_process: readonly path "/proc/fs" mounted DEBUG libcontainer::process::container_init_process: readonly path "/proc/irq" mounted DEBUG libcontainer::process::container_init_process: readonly path "/proc/sys" mounted DEBUG libcontainer::process::container_init_process: readonly path "/proc/sysrq-trigger" mounted DEBUG libcontainer::capabilities: reset all caps DEBUG libcontainer::capabilities: dropping bounding capabilities to Some({NetBindService, AuditWrite, Kill}) DEBUG libcontainer::process::container_main_process: init pid is Pid(457525) DEBUG libcontainer::notify_socket: notify container start DEBUG libcontainer::notify_socket: notify finished DEBUG libcontainer::notify_socket: received: start container DEBUG libcontainer::workload::default: executing workload with default handler bash: cannot set terminal process group (-1): Inappropriate ioctl for device bash: no job control in this shell bash: /root/.bashrc: Permission denied root@runc:/# apt update Reading package lists... Done E: List directory /var/lib/apt/lists/partial is missing. - Acquire (13: Permission denied) ```

This is similar to the issue I was troubleshooting, but different:

root@runc:/# apt update
Reading package lists... Done
E: List directory /var/lib/apt/lists/partial is missing. - Acquire (13: Permission denied)

Any idea what might be causing this? I would expect both runtimes to work more or less the same (at least it shouldn't error). I'm going to dig deeper into the differences.

One thing I noticed from the logs is that youki is trying to setup a network namespace when there is none defined in the config.json, but only when using youki exec:

DEBUG libcontainer::namespaces: unshare or setns: LinuxNamespace { typ: Network, path: Some("/proc/457403/ns/net") }
jeromegn commented 2 months ago

One thing I noticed from the logs is that youki is trying to setup a network namespace when there is none defined in the config.json, but only when using youki exec:

DEBUG libcontainer::namespaces: unshare or setns: LinuxNamespace { typ: Network, path: Some("/proc/457403/ns/net") }

I've fixed this locally. The TenantBuilder will try to use all namespaces for the process, regardless of what namespaces were configured in the spec.

utam0k commented 2 months ago

What you said is here, isn't it? https://github.com/containers/youki/blob/4500c8e6856bcfec2fca20818c23c50bae26d007/crates/libcontainer/src/container/tenant_builder.rs#L328-L329

Do you have any chance to contribute?

yihuaf commented 3 days ago

DEBUG libcontainer::namespaces: unshare or setns: LinuxNamespace { typ: Network, path: Some("/proc/457403/ns/net") }

This is the correct behavior. The log is a little mis-leading here given that this is part of the youki exec path. As the log from youki create indicates, the namespaces are configured correctly based on the config.json. When youki exec is called, we try to join the namespaces of the container init process created by youki create. The /proc/<pid>/ns points to the network namespace of the container init process. This network namespace is the root network namespace, since we did not create a new one based on spec.

In another word, during youki exec, we directly look at what namespace the container init process from youki create is in. The namespace may be newly created or may be inherited from the runtime.

yihuaf commented 3 days ago
DEBUG libcontainer::workload::default: executing workload with default handler
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
bash: /root/.bashrc: Permission denied
root@runc:/# apt update
Reading package lists... Done
E: List directory /var/lib/apt/lists/partial is missing. - Acquire (13: Permission denied)

These bash errors looks like they should be investigated further.