Closed sboeuf closed 6 years ago
@sameo @grahamwhaley @jodh-intel @egernst @amshinde
@sboeuf thanks a lot for the detailed explanation, and for finding that out.
From a CRI-O perspective, when sending a kill(TERM|KILL)
to a container and not getting an error back means the container has been properly stopped. So asking for a container status before unmounting would be redundant.
Now, if you look at the OCI/runc kill operation (It is an OCI specific operation):
StopContainer()
method into an OCI kill(SIGKILL)
operation. There are no KillContainer()
CRI method.In both cases we are basically being asked to stop the container this signal is directed at. In other words there is a bijection between kill(KILL|TERM)
and StopContainer()
.
So imho the first option makes more sense:
[...] we should translate a cc-runtime kill into vc.StopContainer in case the signal is SIGTERM or SIGKILL.
@sameo here is the CRI-O fix: https://github.com/kubernetes-incubator/cri-o/pull/1326
I have discovered that we have a bug about the way we work with CRI-O. Specifically this happens when CRI-O tries to stop a running container:
First you need to know that CRI-O stop command relates to a
kill
command from a runtime perspective, which translates intocc-runtime kill
in our case.Then, you also need to know that CRI-O expect to be able to unmount the storage (rootfs) provided at the
cc-runtime create
stage, after thecc-runtime kill
returned. This is logical if we think about therunc
case where we can expect to be able to unmount everything after the container process has been terminated. Here is the call1 which calls into call2. Here is the call expecting to unmount.Now you understand that we have an issue since we don't stop our container, which means in our case that we don't unmount what has been previously mounted. Take a look at this to understand that we only send a signal in case we receive a
cc-runtime kill
, nothing else.But why is it working then ? That's a good question, and here is the answer:
We have a mechanism that check a container status and the state of the container process (check shim process basically). This mechanism is triggered every time someone ask for the status of the container, and in case of our runtime, all our actions asks for this, but this can be basically achieved by calling into
cc-runtime state
too. And in case the container is marked as RUNNING, but the shim process is not here anymore, we actually update the container status by calling intocontainer.stop
here. This has 2 effects, it stops the container by calling into the agent, and by unmounting everything needed, and it updates the container status to STOPPED.But you wonder who is calling into this so that we can get the container unmounted by luck. Well, CRI-O has a watcher running in a separate goroutine which monitors the container process. When this container process returns (after the kill has been sent by CRI-O), the watcher will trigger a call to
cc-runtime state
here, and this is how we eventually get the container STOPPED and properly unmounted almost at the same time.And now you wonder why we don't get a more obvious race condition since the call to unmount from CRI-O happens before the update of the status. The reason is that the unmount is pretty flexible. It allows for the unmount to last up to 20 seconds. I don't know about the internals of devicemapper here, but I have confirmed this behavior by putting a 30 seconds sleep call between those 2 lines. It issues an error after 20 seconds saying that the device unmount/remove could not be done because it is still busy:
This is a behavior that we have seen sometimes on the CI and we always wondered why we couldn't reproduce locally. The answer is that the CI is much slower because of the nested virtualization and it sometimes hit those timeouts.
Now the question is how do we solve this ?
Well I think we have 2 options:
cc-runtime kill
intovc.StopContainer
in case the signal isSIGTERM
orSIGKILL
.UpdateStatus()
being called right after they stopped the container. This way we would getcc-runtime state
being called BEFORE they try to unmount the storage. But from their perspective, it would make sense too. They should not try to unmount/remove the storage after they sent a kill, they should actually make sure the container is properly stopped before to go further.