Closed sboeuf closed 6 years ago
lgtm
Have we got tests in place to check these changes though? Have you run @grahamwhaley's magic https://github.com/clearcontainers/tests/blob/master/integration/stability/soak_parallel_rm.sh on this?
@grahamwhaley - actually, is that script ready to be run as part of the CI checks now dyt? If so, could you raise an issue for that?
@jodh-intel we need to check and probably rework that test a little. I think it needs two 'updates':
/me notes all the CIs look to have failed ?
Can you clarify why the different in kata-agent (at creation) versus cc-agent (at start)? Any impact performance wise of doing the hot plugs earlier?
Shouldn't we still be able to hot-plug devices later as well, or this just specific to the container workload rootfs?
@egernst the code taking care of the rootfs hotplug (in case of devicemapper) has to be called before we try to create a container from the agent in case of Kata, otherwise the container won't be properly created. In case of hyperstart, we were doing this at the start stage because the call to agent.createContainer()
was actually a no-op in this case. I think that Hyperstart is more a particular case, but the logic is to prepare everything (rootfs, mounts, devices) so that the container can be properly created without missing any resources. That's why I globally moved this code from container.start()
to createContainer()
.
We were doing a lot of things from the start stage, but we actually realized that most of the work has to be performed at creation time. This way, the start stage becomes a simple trigger to actually start the container workload.
Does that make sense ?
Note on this PR. I think I have found why the CRIO tests were not passing for devicemapper case only (ubuntu 17.10). All the failing tests were dealing with created containers that were never started. In this case when the container is stopped by CRIO, we get a kill signal, meaning we kill the shim without passing the signal to the agent (not supported), and we set the container state to STOPPED. For this reason, the container.stop()
is called (when CRIO triggers a cc-runtime state
) but it returns immediately because the container is already stopped.
In the context of this PR moving the rootfs creation from container.start()
to createContainer()
, this is the root cause of the observed issues... Indeed, container.stop()
not being called was not an issue until now, because a CREATED container had not done any extra mount or hotplug.
@sameo please review it, I have managed to get CRI-O issues fixed by allowing a container to be STOPPED from a READY state (which is a valid case). To avoid any confusion, issue #606 is a different issue which needs to be addressed separately.
The goal of this pull request is to enable the support for block devices for Kata Containers. It appeared that our sequence in container.go was tied to the hyperstart agent expecting things to be hotplugged only at the start stage. In case of Kata Containers, we need this to happen at the creation stage. This won't hurt the case for the hyperstart agent, and this will make it work for Kata too. Now this pull request also solves other issues about implicit update of the mount list that has been discovered while trying to solve the initial problem. That's why it also refactor a bit the way bind mounts from each agent implementation actually updates the list of mounts held by the container structure.