Improve VSock dialing timeout

sipsma commented 5 years ago

Our current vsock dialer implementation does exponential backoff from 100ms to 1.6s before giving up.

I encountered a situation in the real world in which this timeout was too short and resulted in ctr run to fail unnecessarily. The particular situation was when I attached strace to firecracker as it started in (in order to debug a separate issue) which understandably significantly slowed down the VM startup time. I could see in the strace output that firecracker was still just in the midst of copying the VM's rootfs when firecracker-containerd gave up dialing to the VSock. When I just increased the (currently hardcoded) timeout to try one more time (6 retries instead of 5) ctr run completed successfully.

While I encountered this when attaching strace, its seems plausible the timeout could be hit in other real-world situations, such as slower hardware than an i3.metal and/or large VM rootfs images.

There's a few fixes possible here (not mutually exclusive):

Just increase the timeout
Make the timeout configurable
See if checking the status of the VM (enum here) can help improve the logic here; i.e. should we first have a waiting period for the status to change Running and then have a separate waiting period for trying to connect to the agent?

mxpv commented 5 years ago

FC's Running state doesn't guaranty that Agent is initialized and ready to serve requests. So option 3 adds 2 kinds of retry: wait for FC API available / state=Running, and wait for Agent ready and listening to vsock.

sipsma commented 5 years ago

@mxpv Right, that's what I was suggesting, 2 separate kinds of retries, once for Firecracker to even start up the VM (which includes copying the rootfs image I'd presume) and then a separate one for Agent to become available (apologies if I wasn't clear in the OP).

I would not want to jump to that if the first 2 options are enough as it introduces some complication, but just wanted to make a note of it as a possibility in case it ends up making sense.

mxpv commented 5 years ago

What are benefits of adding one more retry logic? I would rather vote for option 1 :)

sipsma commented 5 years ago

It would require more investigation and very well may not end up being beneficial, but the potential benefit of 3 I was imagining would be if we want to be more "lenient" with the VM startup time (to accommodate large VM images and/or slower hardware) and less so with waiting for the agent to become available after the VM was started (which should in theory take a lot less time than the VM's startup time in most situations).

I agree we at minimum should do option 1 and then only consider option 2/3 as layers on top of option 1 in case we find they would actually provide additional benefit.

mxpv commented 5 years ago

The reason I vote for option 1 is simplicity. It's just dials with retries. Option 3 adds extra complexity while benefits are a bit blurred for me. But I agree that option 1 is far away from perfect and needs to be improved somehow.

mxpv commented 5 years ago

Option 4: make agent responsible for initial dialing to runtime (e.g. Runtime runs a listener, runs a VM and waits for Agent to connect, and after accepting a request Runtime knows that Agent is up and ready). But this raises a lot of questions (including security concerns).

sipsma commented 5 years ago

I guess the idea behind Option 3 is that it doesn't make sense to wait for the agent to connect if the VM hasn't even started yet. If the VM fails to start within a certain duration, it would be better to explicitly return an error saying that happened as opposed to saying that the Agent failed to connect within a certain time period.

samuelkarp commented 5 years ago

@sipsma Does your work on https://github.com/firecracker-microvm/firecracker-containerd/pull/266 resolve this?

sipsma commented 5 years ago

@samuelkarp Not really. I think it would still be a worthy enhancement to make the various vsock dialing timeouts configurable. There are various tradeoffs between cpu usage and latency that users could in theory want to optimize for, or they may need to use much larger timeout values for certain slower platforms.

firecracker-microvm / firecracker-containerd

Improve VSock dialing timeout #133