Req: make `wasi_socket_working` more configurable instead of hard coding retry times

Mossaka commented 4 months ago

I was looking into an issue regarding the slow startup time of runwasi shims, and by inspecting the traces, I found that the wait_socket_working(&address, 5, 200) call always took a second to finish if the address is there but for some reason not able to establish a connection to the ttrpc client correctly.

code ref: https://github.com/containerd/rust-extensions/blob/main/crates/shim/src/synchronous/mod.rs#L471-L476

I would appreciate if you could clarify the motivation behind call to wait_socket_working. Why coulnd't we proceed to remove the socket if it's already in use and then call start_listener immediately? Also, what's the motivation behind the one second wait time, which has a significant impact on the startup time of the shims?

Here is a modified code that doesn't wait for sockets:

#[allow(clippy::let_unit_value)]
let _listener = match start_listener(&address) {
    Ok(l) => l,
    Err(e) => {
        if e.kind() != std::io::ErrorKind::AddrInUse {
            return Err(Error::IoError {
                context: "".to_string(),
                err: e,
            });
        };
        if socket_exist(&address)? {
            remove_socket(&address)?;
        }
        start_listener(&address).map_err(io_error!(e, ""))?
    }
};

Mossaka commented 4 months ago

Does the shim need to handle the deletion of the shim socket address (e.g. /run/containerd/<hash>)? I am not seeing where in the runc-shim delete the socket address.

mxpv commented 4 months ago

Yeah, it looks like it just exhausts all retries (which is 1 second) and then removes socket. IMO we can remove it.

pinging @abel-von and @Burning1020 . Any chance you have more context here?

abel-von commented 4 months ago

I think this is for support of multi containers in one pod, containerd will call shim start multiple times, We should have to return the address directly rather than removing the socket and listen again. I don't know why can't you establish the connection?

Mossaka commented 4 months ago

@abel-von that makes sense, but I think the retry timeout is unusually long - a whole second! What are the reasons for trying 200 times for 5ms each? Can we reduce the number of retries?

abel-von commented 4 months ago

@Mossaka maybe we can make it a configurable value, but I'd like to know why is the connection not established everytime?

Mossaka commented 3 months ago

maybe we can make it a configurable value

I like that, and changed this issue's title to reflect the request.

but I'd like to know why is the connection not established

There is a bug in runwasi that prevents it from deleting the socket address after ttrpc server closes. Thus it hangs over connection to the socket address.

containerd / rust-extensions

Req: make `wasi_socket_working` more configurable instead of hard coding retry times #263