contiv / netplugin

Container networking for various use cases
Apache License 2.0
515 stars 177 forks source link

netplugin panics when etcd is stopped #385

Open mapuri opened 8 years ago

mapuri commented 8 years ago

I noticed this while experimenting with etcd container restart. Just filing it to discuss if this needs to be resolved in netplugin in anyway. I feel it might be desirable to handle this a bit more gracefully by retrying the connection and failing subsequent client requests.

Right now the system get's into a weird state with this behavior. Basically once docker is stopped, etcd container stops and netplugin panics. From this point on restarting docker doesn't work as it seems to be waiting on netplugin (which looks like docker bug but I couldn't find any known issues). Netplugin won't start as it needs etcd. And etcd won't start as it needs docker :)

Docker version:

[vagrant@host0 ~]$ sudo docker version
Client:
 Version:      1.11.1
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   5604cbe
 Built:        Wed Apr 27 00:34:42 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.1
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   5604cbe
 Built:        Wed Apr 27 00:34:42 2016
 OS/Arch:      linux/amd64

Netplugin Version:

[vagrant@host0 ~]$ netplugin --version
Version: v0.1-05-16-2016.08-29-25.UTC
GitCommit: ccd7e35
BuildTime: 05-16-2016.08-29-25.UTC

[vagrant@host0 ~]$ netctl version
Client Version:
Version: v0.1-05-16-2016.08-29-25.UTC
GitCommit: ccd7e35
BuildTime: 05-16-2016.08-29-25.UTC

Server Version:
Version: v0.1-05-16-2016.08-29-25.UTC
GitCommit: ccd7e35
BuildTime: 05-16-2016.08-29-25.UTC

Docker logs showing that it is waiting on netplugin:

May 24 21:45:24 host0 docker[19111]: time="2016-05-24T21:45:24.828073857Z" level=error msg="Post http://%2Frun%2Fdocker%2Fplugins%2Fnetplugin.sock/Plugin.Activate: dial unix /run/docker/plu
gins/netplugin.sock: connect: connection refused"
May 24 21:45:37 host0 docker[19111]: time="2016-05-24T21:45:37.198167344Z" level=warning msg="Unable to connect to plugin: /run/docker/plugins/netplugin.sock:/Plugin.Activate, retrying in 1
s"
May 24 21:45:38 host0 docker[19111]: time="2016-05-24T21:45:38.246612384Z" level=warning msg="Unable to connect to plugin: /run/docker/plugins/netplugin.sock:/Plugin.Activate, retrying in 2
s"
May 24 21:45:40 host0 docker[19111]: time="2016-05-24T21:45:40.338515458Z" level=warning msg="Unable to connect to plugin: /run/docker/plugins/netplugin.sock:/Plugin.Activate, retrying in 4
s"
May 24 21:45:44 host0 docker[19111]: time="2016-05-24T21:45:44.396971070Z" level=warning msg="Unable to connect to plugin: /run/docker/plugins/netplugin.sock:/Plugin.Activate, retrying in 8
s"
May 24 21:45:52 host0 docker[19111]: time="2016-05-24T21:45:52.524959443Z" level=error msg="Post http://%2Frun%2Fdocker%2Fplugins%2Fnetplugin.sock/Plugin.Activate: dial unix /run/docker/plu
gins/netplugin.sock: connect: connection refused"

Netplugin logs showing panic backtrace:

May 24 21:43:26 host0 netplugin[18589]: time="May 24 21:43:26.601878402" level=error msg="Error client: etcd cluster is unavailable or misconfigured during watch"
May 24 21:43:26 host0 netplugin[18589]: panic: runtime error: invalid memory address or nil pointer dereference
May 24 21:43:26 host0 netplugin[18589]: [signal 0xb code=0x1 addr=0x10 pc=0x75c16b]
May 24 21:43:26 host0 netplugin[18589]: goroutine 182 [running]:
May 24 21:43:26 host0 netplugin[18589]: github.com/contiv/netplugin/state.(*EtcdStateDriver).channelEtcdEvents(0xc82014bf60, 0x7f273650d900, 0xc82047a440, 0xc82048e900)
May 24 21:43:26 host0 netplugin[18589]: /opt/gopath/src/github.com/contiv/netplugin/state/etcdstatedriver.go:136 +0x21b
May 24 21:43:26 host0 netplugin[18589]: created by github.com/contiv/netplugin/state.(*EtcdStateDriver).WatchAll
May 24 21:43:26 host0 netplugin[18589]: /opt/gopath/src/github.com/contiv/netplugin/state/etcdstatedriver.go:162 +0x1e9
May 24 21:43:26 host0 netplugin[18589]: goroutine 1 [chan receive, 2 minutes]:
May 24 21:43:26 host0 netplugin[18589]: main.handleEvents(0xc8200f5f00, 0xc8200f8b80, 0x5, 0x7ffe9877aedd, 0x6, 0x0, 0x0, 0x0, 0x0, 0x7ffe9877af11, ...)
May 24 21:43:26 host0 netplugin[18589]: /opt/gopath/src/github.com/contiv/netplugin/netplugin/netd.go:316 +0x11c
May 24 21:43:26 host0 netplugin[18589]: main.main()
May 24 21:43:26 host0 netplugin[18589]: /opt/gopath/src/github.com/contiv/netplugin/netplugin/netd.go:526 +0x11b0
May 24 21:43:26 host0 netplugin[18589]: goroutine 17 [syscall, 2 minutes, locked to thread]:
May 24 21:43:26 host0 netplugin[18589]: runtime.goexit()
May 24 21:43:26 host0 netplugin[18589]: /usr/local/go/src/runtime/asm_amd64.s:1696 +0x1
May 24 21:43:26 host0 netplugin[18589]: goroutine 12 [runnable]:
May 24 21:43:26 host0 netplugin[18589]: net.runtime_pollWait(0x7f27364fc4f8, 0x72, 0xc82000e120)
May 24 21:43:26 host0 netplugin[18589]: /usr/local/go/src/runtime/netpoll.go:157 +0x60
yekaifeng commented 8 years ago

Since etcd is so important, consider running it as system service instead of container.