containers / virtcontainers

A Go package for building hardware virtualized container runtimes
Apache License 2.0
139 stars 43 forks source link

"docker run" never returns #491

Closed jodh-intel closed 6 years ago

jodh-intel commented 6 years ago

As part of landing #483, I had to create a test runtime branch which included:

Testing locally showed that if the user runs:

$ sudo docker run -ti busybox true

even though the shim and hypervisor shut down, docker run will hang.

The relevant lines from the journal:

time="2017-11-20T16:25:03Z" level=info arguments="\"kill --all 72da543040bc50715782075f750b543d99cabd8dd568351e4421a67bc4e29e2a SIGKILL\"" commit=9c69
time="2017-11-20T16:25:03Z" level=debug msg="Checking nesting environment" inside-vm=true source=virtcontainers subsystem=qemu
time="2017-11-20T16:25:03Z" level=info msg="unplugging block device" container-id=72da543040bc50715782075f750b543d99cabd8dd568351e4421a67bc4e29e2a pod
time="2017-11-20T16:25:03Z" level=info msg="Unplugging block device" container-id=72da543040bc50715782075f750b543d99cabd8dd568351e4421a67bc4e29e2a dev
time="2017-11-20T16:25:03Z" level=info msg="{\"QMP\": {\"version\": {\"qemu\": {\"micro\": 1, \"minor\": 7, \"major\": 2}, \"package\": \"(2.7.1+git.d
time="2017-11-20T16:25:03Z" level=info msg="{\"execute\":\"qmp_capabilities\"}" source=virtcontainers subsystem=qmp
time="2017-11-20T16:25:03Z" level=info msg="{\"return\": {}}" source=virtcontainers subsystem=qmp
time="2017-11-20T16:25:03Z" level=info msg="{\"arguments\":{\"id\":\"virtio-drive-72da543040bc50715782075f7\"},\"execute\":\"device_del\"}" source=vir
time="2017-11-20T16:25:03Z" level=info msg="{\"return\": {}}" source=virtcontainers subsystem=qmp
time="2017-11-20T16:25:04Z" level=info msg="{\"timestamp\": {\"seconds\": 1511195104, \"microseconds\": 39143}, \"event\": \"DEVICE_DELETED\", \"data\
time="2017-11-20T16:25:04Z" level=info msg="{\"timestamp\": {\"seconds\": 1511195104, \"microseconds\": 39414}, \"event\": \"DEVICE_DELETED\", \"data\
time="2017-11-20T16:25:04Z" level=info msg="{\"arguments\":{\"id\":\"drive-72da543040bc50715782075f7\"},\"execute\":\"x-blockdev-del\"}" source=virtco
time="2017-11-20T16:25:04Z" level=info msg="{\"return\": {}}" source=virtcontainers subsystem=qmp
time="2017-11-20T16:25:04Z" level=debug msg="Checking nesting environment" inside-vm=true source=virtcontainers subsystem=qemu
time="2017-11-20T16:25:04Z" level=error msg="Container not running, impossible to signal the container" source=runtime
time="2017-11-20T16:25:04Z" level=debug msg="loaded configuration" format=TOML source=runtime
time="2017-11-20T16:25:04Z" level=info arguments="\"delete 72da543040bc50715782075f750b543d99cabd8dd568351e4421a67bc4e29e2a\"" commit=9c69a8a60f4da9d5
time="2017-11-20T16:25:04Z" level=debug msg="Checking nesting environment" inside-vm=true source=virtcontainers subsystem=qemu
time="2017-11-20T16:25:04Z" level=debug msg="Checking nesting environment" inside-vm=true source=virtcontainers subsystem=qemu
time="2017-11-20T16:25:04Z" level=error msg="ERROR received from VM agent" source=runtime
time="2017-11-20T16:25:04.067015537Z" level=debug msg="containerd: process exited" id=72da543040bc50715782075f750b543d99cabd8dd568351e4421a67bc4e29e2a pi
time="2017-11-20T16:25:04Z" level=debug msg="loaded configuration" format=TOML source=runtime
time="2017-11-20T16:25:04Z" level=info arguments="\"delete 72da543040bc50715782075f750b543d99cabd8dd568351e4421a67bc4e29e2a\"" commit=9c69a8a60f4da9d5
time="2017-11-20T16:25:04Z" level=debug msg="Checking nesting environment" inside-vm=true source=virtcontainers subsystem=qemu
time="2017-11-20T16:25:04Z" level=debug msg="Checking nesting environment" inside-vm=true source=virtcontainers subsystem=qemu
time="2017-11-20T16:25:04Z" level=info msg="Read on CTL channel ended: read unix @->/run/virtcontainers/pods/72da543040bc50715782075f750b543d99cabd8dd56

The problem appears to be commit 711783f547cb679c698c2081a5a110648282cda6 from #482.

jodh-intel commented 6 years ago

@sboeuf - do you want to take a look at this one?

sboeuf commented 6 years ago

@jodh-intel I have some high priority work to do on the agent, but as soon as I am available, I can take a look. But you should try to take a look in the meantime.

sboeuf commented 6 years ago

@jodh-intel I have just gave this a try and I found where the problem is. My last changes on virtcontainers need to rely on the latest agent. Unfortunately, the agent changes have not landed yet on a new clear-containers.img.... If you build the agent yourself from the master of the agent repo, and you replace the agent binary inside your clear-containers.img, everything works perfectly. Please give this a try and let me know how it goes.

cc @mcastelino @sameo

jodh-intel commented 6 years ago

After quite a bit of debugging (confused by the fact that creating a custom agent using osbuilder was behaving differently due to newer code :), I think the minimal fix is the following but I'd like your thoughts on this:

diff --git a/container.go b/container.go
index 3c12f23..981b3cb 100644
--- a/container.go
+++ b/container.go
@@ -677,6 +677,11 @@ func (c *Container) kill(signal syscall.Signal, all bool) error {
                return c.setContainerState(StateStopped)
        }

+       if state.State == StateStopped {
+               // already shutdown
+               return nil
+       }
+
        if state.State != StateRunning {
                return fmt.Errorf("Container not running, impossible to signal the container")
        }

It also looks like I'm also going to have to rework #483 to change the way DeletePod() works due to recent vc + agent changes. I can't wait for the (pod-specific) proxy instance to exit any more due to https://github.com/clearcontainers/agent/commit/25e604732f8ef2b3d8d207681eff06a3b57d4b54 so will either need to tell it to shutdown from vc, or just kill it (but not wait as vc/runtime won't be the parent and cannot therefore handle the scenario as the non-parent wait options won't work reliably in this scenario).

jodh-intel commented 6 years ago

It looks like the above patch is somehow miraculously no longer needed. Re-testing by re-vendoring the latest virtcontainers changes into the latest runtime with my vc changes on top and using the latest agent via osbuilder, it now all works! I can only assume there was some cruft left over somewhere when I was testing the changes at the runtime level. Or fairy dust, cosmic rays, pixies or trolls. Or aliens.

Closing now that https://github.com/containers/virtcontainers/pull/483 is behaving.

jodh-intel commented 6 years ago

:alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien::alien:

mcastelino commented 6 years ago

@jodh-intel do we still need to rebuild the image?