Running complex code after the process is `clone`d in `libcontainer` can lead to UB

jprendes commented 1 year ago

@utam0k calling setgroups / setresgid / setresuid directly would workaround the issue. I am slightly concerned because in normal youki / runc, we soon call execve, which starts a clean process. In the case of runwasi, we don't, we keep running on the same tainted process. From nix::clone and nix::fork, even memory allocation might be unsafe after the clone call (I assume it's platform / variant dependent).

Originally posted by @jprendes in https://github.com/containerd/runwasi/issues/347#issuecomment-1755433335

jprendes commented 1 year ago

Some context:

The code we run in our impl of the libcontainer's Executor is run in a cloned process.

In POSIX, the kind of code that is guaranteed to run safely in the child process after a clone is restricted to async-signal-safe code. Examples of code that are NOT guaranteed to be AS-safe include malloc, printf and pthread_*.

See for example how the snippet in nix::fork uses write(..) in the child process instead of println!(..) exactly because of this.

It seems that in our case with glibc and musl, it works fine, but this is fragile.

jprendes commented 10 months ago

I'm thinking that a possible approach would be to use the async api of containerd/rust-extensions and have the shim be a single threaded process.

This is assuming that the way rust-extensions handles async is single threaded.

Mossaka commented 7 months ago

Hey @jprendes I know you've worked on the asyncify runwasi, do you have any updates on that one? Do you think we need another issue to track the progress of it?

jprendes commented 5 months ago

I can think of two options to workaround this issue:

calling exec and replacing the whole process

That's what normal for-exec does, and what a normal container does. The disadvantage is that we loose any sharing between the shim and the container (e.g., anything pre-initialized by the shim won't seen by the container). I know @cpuguy83 would like to see all containers in a pod sharing the same runtime, and that would become impossible with this approach (admitedly, that's not the direction we are heading anyway with using youki). The advantage is that for windows we would need to do that anyway (IIUC, in windows there's no fork and exec, but rather fork+exec as one single thing), and having a similar behavior in all platforms might be nice.

Running the shim async, and in only one thread

I would eventually like to see that regardless of this issue. The disadvantage is that it's a bit more difficult to enforce, as even some tokio operations might inadvertently spin new threads. Also the async feature in rust-extensions still spins a few threads, and is not supported on Windows. The advantage, is that I would expect lower resource usage by the shim, better integration of the containerd client code, and would fix this issue without preventing sharing resources between shim and container.

cpuguy83 commented 5 months ago

I think exec is probably the way to go for now. I think we can go forward with that and then look at how we might optimize things. e.g. can we send rpc's into that container to do things like handle (container) execs.

jprendes commented 4 months ago

Thinking further about it, we can't exec, as the target binary would be the shim binary, which is not available inside the container. AFAIK, the single threaded process would be the only solution here.

jprendes commented 4 months ago

Thinking further about it, we can't exec, as the target binary would be the shim binary, which is not available inside the container. AFAIK, the single threaded process would be the only solution here.

@jsturtevant , would this be a problem for the Windows implementation you had in mind (where it would crate a new child process)?

cpuguy83 commented 4 months ago

I'm 👎 on using async as a way to work around this. This is a (potential) fix by coincidence and not a proper one.

I've said this a few times, but I think the abstraction we are using from youki is doing more than what we actually need (or perhaps isn't flexible enough) for wasi containers.

utam0k commented 4 months ago

I've said this a few times, but I think the abstraction we are using from youki is doing more than what we actually need (or perhaps isn't flexible enough) for wasi containers.

Yes, I agree with a part of @cpuguy83 in this option. Sorry, I don't have enough knowledge to speak aync way.

When I incorporated youki into runwasi at that time https://github.com/containerd/runwasi/issues/23 it was still difficult to consider the future what would happen. This is because we did not know what would be needed. Now we know what we need in the wasi container more concrete. So we may be able to list how OCI compliant we need to be, i.e., which features we need and which we don't. Then, we may be able to easily build a wasi container directly using functions in a lower layer of libcontainer instead of using the Builder pattern of libcontainer. For a simple example, we can use cgroup and so on. Also, you can use the rootfs functions from youki. If you want to do that I can consider it once and I'd be happy to help you with it.

If you want to be OCI compliant, I think it would be best to go in the direction of extensibility using the Builder pattern. So there is a tradeoff there.

containerd / runwasi

Running complex code after the process is `clone`d in `libcontainer` can lead to UB #357

calling exec and replacing the whole process

Running the shim async, and in only one thread