Open sipsma opened 5 years ago
FC's Running
state doesn't guaranty that Agent is initialized and ready to serve requests.
So option 3 adds 2 kinds of retry: wait for FC API available / state=Running, and wait for Agent ready and listening to vsock.
@mxpv Right, that's what I was suggesting, 2 separate kinds of retries, once for Firecracker to even start up the VM (which includes copying the rootfs image I'd presume) and then a separate one for Agent to become available (apologies if I wasn't clear in the OP).
I would not want to jump to that if the first 2 options are enough as it introduces some complication, but just wanted to make a note of it as a possibility in case it ends up making sense.
What are benefits of adding one more retry logic? I would rather vote for option 1 :)
It would require more investigation and very well may not end up being beneficial, but the potential benefit of 3 I was imagining would be if we want to be more "lenient" with the VM startup time (to accommodate large VM images and/or slower hardware) and less so with waiting for the agent to become available after the VM was started (which should in theory take a lot less time than the VM's startup time in most situations).
I agree we at minimum should do option 1 and then only consider option 2/3 as layers on top of option 1 in case we find they would actually provide additional benefit.
The reason I vote for option 1 is simplicity. It's just dials with retries. Option 3 adds extra complexity while benefits are a bit blurred for me. But I agree that option 1 is far away from perfect and needs to be improved somehow.
Option 4: make agent responsible for initial dialing to runtime (e.g. Runtime runs a listener, runs a VM and waits for Agent to connect, and after accepting a request Runtime knows that Agent is up and ready). But this raises a lot of questions (including security concerns).
I guess the idea behind Option 3 is that it doesn't make sense to wait for the agent to connect if the VM hasn't even started yet. If the VM fails to start within a certain duration, it would be better to explicitly return an error saying that happened as opposed to saying that the Agent failed to connect within a certain time period.
@sipsma Does your work on https://github.com/firecracker-microvm/firecracker-containerd/pull/266 resolve this?
@samuelkarp Not really. I think it would still be a worthy enhancement to make the various vsock dialing timeouts configurable. There are various tradeoffs between cpu usage and latency that users could in theory want to optimize for, or they may need to use much larger timeout values for certain slower platforms.
Our current vsock dialer implementation does exponential backoff from 100ms to 1.6s before giving up.
I encountered a situation in the real world in which this timeout was too short and resulted in
ctr run
to fail unnecessarily. The particular situation was when I attached strace to firecracker as it started in (in order to debug a separate issue) which understandably significantly slowed down the VM startup time. I could see in the strace output that firecracker was still just in the midst of copying the VM's rootfs when firecracker-containerd gave up dialing to the VSock. When I just increased the (currently hardcoded) timeout to try one more time (6 retries instead of 5)ctr run
completed successfully.While I encountered this when attaching strace, its seems plausible the timeout could be hit in other real-world situations, such as slower hardware than an i3.metal and/or large VM rootfs images.
There's a few fixes possible here (not mutually exclusive):
Running
and then have a separate waiting period for trying to connect to the agent?