Why is the shim running on it's own network ns?

Mossaka commented 5 months ago

Ref: https://github.com/containerd/runwasi/blob/71f8df9cc576ea9564b3ea692432e80b454da7e5/crates/containerd-shim-wasm/src/sandbox/shim/cli.rs#L56-L57

It is not clear to me the reason why the shim start function sets up it's own network ns. I checked out both the Go and Rust implementation of runc-shim, they aren't setting the same ns at shim start.

@jsturtevant pointed me to this https://github.com/containers/youki/issues/2473 from youki but it looks like youki was setting up all the namespaces for containers but somehow the network ns was ignored. I am raising the issue here for asking more investigation.

FYI @jprendes

cpuguy83 commented 5 months ago

Before youki we did everything ourselves. It may not be necessary to do this anymore.

jprendes commented 5 months ago

From https://github.com/containerd/runwasi/pull/327#issuecomment-1730071090, removing that setup_namespaces call means that the network ns is not set up at all. At the time I thought that maybe youki wasn't setting up the network ns, but it does. It would be great if anyone has further insight.

jsturtevant commented 5 months ago

We removed it awhile back and broke everything 🤦 https://github.com/containerd/runwasi/pull/364

cpuguy83 commented 5 months ago

youki is only setting the Pid and User namespaces

That doesn't seem right :(

0xE282B0 commented 5 months ago

I removed it once and @jsturtevant needed to add it again here

jprendes commented 5 months ago

youki is only setting the Pid and User namespaces

That doesn't seem right :(

That was me misreading the code. Youki sets Pid and User ns at one point, and the rest on a different place. IIRC, Pid and User ns need to be set up in the intermediate process, while the rest are set up in the init process.

Mossaka commented 5 months ago

Youki sets Pid and User ns at one point, and the rest on a different place.

Okay, if this is true, then the shim doesn't need to set up its own network ns. There must be some root causes that we haven't yet discovered.

jprendes commented 5 months ago

I've added some logging to runwasi and youki when setting namespaces, and got:

level=info msg=">>>>> runwasi > unshare(CloneFlags(CLONE_NEWNET))"
level=info msg="server listen started"
level=info msg="server started"
level=info msg="Shim successfully started, waiting for exit signal..."
level=debug msg="Got new client"
...
level=warn msg="Controller rdma is not yet implemented."
level=warn msg="Controller misc is not yet implemented."
level=debug msg="unshare or setns: LinuxNamespace { typ: Pid, path: None }"
level=info msg=">>>>> youki > unshare(CloneFlags(CLONE_NEWPID))"
level=debug msg="sending init pid (Pid(1883648))"
level=debug msg="unshare or setns: LinuxNamespace { typ: Uts, path: None }"
level=info msg=">>>>> youki > unshare(CloneFlags(CLONE_NEWUTS))"
level=debug msg="unshare or setns: LinuxNamespace { typ: Ipc, path: None }"
level=info msg=">>>>> youki > unshare(CloneFlags(CLONE_NEWIPC))"
level=debug msg="unshare or setns: LinuxNamespace { typ: Network, path: None }"
level=info msg=">>>>> youki > unshare(CloneFlags(CLONE_NEWNET))"
level=debug msg="unshare or setns: LinuxNamespace { typ: Cgroup, path: None }"
level=info msg=">>>>> youki > unshare(CloneFlags(CLONE_NEWCGROUP))"
level=debug msg="unshare or setns: LinuxNamespace { typ: Mount, path: None }"
level=info msg=">>>>> youki > unshare(CloneFlags(CLONE_NEWNS))"
level=debug msg="prepare rootfs rootfs="/var/lib/docker/rootfs/overlayfs/7c52b78c3a22d2a00cb67cdab9fd090241ef9cdf8f47bb0d7f20daba465662a2""
level=debug msg="mount root fs "/var/lib/docker/rootfs/overlayfs/7c52b78c3a22d2a00cb67cdab9fd090241ef9cdf8f47bb0d7f20daba465662a2""
level=debug msg="mounting Mount { destination: "/proc", typ: Some("proc"), source: Some("proc"), options: Some(["nosuid", "noexec", "nodev"]) }"
...

The lines with >>>>> is what I've added. The network namespce is indeed being set twice. By runwasi when the shim starts, and by youki on the init process.

Mossaka commented 4 months ago

It looks like after removing the network ns from shim start, the nginx pod had issues to start: https://github.com/containerd/runwasi/actions/runs/8148869974/job/22272757552?pr=503#step:8:794

yihuaf commented 1 month ago

Pid and User ns need to be set up in the intermediate process

Just a drive by comment. The reason for this is that the pid and user namespace are a bit special and required special treatment. For the pid namespace, one can't simply join the pid namespace. Only the child process can join the new pid namespace, so we have to fork. In youki's case, we do this in the intermediate process to fork the init process.

user namespace is also special. In order to setup user namespace correctly, we need a parent process to setup the uid/gid mapping. This can't be set once inside the user namespace. So we join the user namespace when we fork the intermediate process from the main process. The intermediate process will join the user namespace and the main process will setup the uid/gid mapping.

jsturtevant commented 1 month ago

@yihuaf Thanks for giving some context. Is what we are doing now, correct or do we need to make some changes in runwasi?

containerd / runwasi

Why is the shim running on it's own network ns? #476