Race condition with CRI-O

sboeuf commented 6 years ago

I have discovered that we have a bug about the way we work with CRI-O. Specifically this happens when CRI-O tries to stop a running container:

POD_ID=$(sudo crictl runp test/testdata/sandbox_config.json)
CONTAINER_ID=$(sudo crictl create $POD_ID test/testdata/container_redis.json test/testdata/sandbox_config.json)
sudo crictl start $CONTAINER_ID

sudo crictl stop $CONTAINER_ID

First you need to know that CRI-O stop command relates to a kill command from a runtime perspective, which translates into cc-runtime kill in our case.

Then, you also need to know that CRI-O expect to be able to unmount the storage (rootfs) provided at the cc-runtime create stage, after the cc-runtime kill returned. This is logical if we think about the runc case where we can expect to be able to unmount everything after the container process has been terminated. Here is the call1 which calls into call2. Here is the call expecting to unmount.

Now you understand that we have an issue since we don't stop our container, which means in our case that we don't unmount what has been previously mounted. Take a look at this to understand that we only send a signal in case we receive a cc-runtime kill, nothing else.

But why is it working then ? That's a good question, and here is the answer:

We have a mechanism that check a container status and the state of the container process (check shim process basically). This mechanism is triggered every time someone ask for the status of the container, and in case of our runtime, all our actions asks for this, but this can be basically achieved by calling into cc-runtime state too. And in case the container is marked as RUNNING, but the shim process is not here anymore, we actually update the container status by calling into container.stop here. This has 2 effects, it stops the container by calling into the agent, and by unmounting everything needed, and it updates the container status to STOPPED.

But you wonder who is calling into this so that we can get the container unmounted by luck. Well, CRI-O has a watcher running in a separate goroutine which monitors the container process. When this container process returns (after the kill has been sent by CRI-O), the watcher will trigger a call to cc-runtime state here, and this is how we eventually get the container STOPPED and properly unmounted almost at the same time.

And now you wonder why we don't get a more obvious race condition since the call to unmount from CRI-O happens before the update of the status. The reason is that the unmount is pretty flexible. It allows for the unmount to last up to 20 seconds. I don't know about the internals of devicemapper here, but I have confirmed this behavior by putting a 30 seconds sleep call between those 2 lines. It issues an error after 20 seconds saying that the device unmount/remove could not be done because it is still busy:

Stopping the container failed: rpc error: code = Unknown desc = failed to unmount container bb325756e1c246f0c7b6639f17d84b841c46bc31ea638f0df716ae6e38425c13: Device is Busy

This is a behavior that we have seen sometimes on the CI and we always wondered why we couldn't reproduce locally. The answer is that the CI is much slower because of the nested virtualization and it sometimes hit those timeouts.

Now the question is how do we solve this ?

Well I think we have 2 options:

Either we consider we should translate a cc-runtime kill into vc.StopContainer in case the signal is SIGTERM or SIGKILL.
Or we ask CRI-O folks to fix this into CRI-O. It would honestly make more sense into CRI-O IMO. I think CRI-O should have the UpdateStatus() being called right after they stopped the container. This way we would get cc-runtime state being called BEFORE they try to unmount the storage. But from their perspective, it would make sense too. They should not try to unmount/remove the storage after they sent a kill, they should actually make sure the container is properly stopped before to go further.

sboeuf commented 6 years ago

@sameo @grahamwhaley @jodh-intel @egernst @amshinde

sameo commented 6 years ago

@sboeuf thanks a lot for the detailed explanation, and for finding that out.

From a CRI-O perspective, when sending a kill(TERM|KILL) to a container and not getting an error back means the container has been properly stopped. So asking for a container status before unmounting would be redundant.

Now, if you look at the OCI/runc kill operation (It is an OCI specific operation):

If we get it straight from Docker, it means the caller wants to stop the container init process, i.e. the process itself (when using KILL or TERM, at least).
If we get it through a CRI implementation, it's because the implementation has translated the CRI StopContainer() method into an OCI kill(SIGKILL) operation. There are no KillContainer() CRI method.

In both cases we are basically being asked to stop the container this signal is directed at. In other words there is a bijection between kill(KILL|TERM) and StopContainer(). So imho the first option makes more sense:

[...] we should translate a cc-runtime kill into vc.StopContainer in case the signal is SIGTERM or SIGKILL.

sboeuf commented 6 years ago

@sameo here is the CRI-O fix: https://github.com/kubernetes-incubator/cri-o/pull/1326

containers / virtcontainers

Race condition with CRI-O #606