cnabio / cnab-go

A Go implementation of CNAB Core 1.0
MIT License
69 stars 37 forks source link

Regression waiting for container to stop, ContainerWait never returns #171

Closed carolynvs closed 4 years ago

carolynvs commented 4 years ago

There was a regression introduced in #169 that I found when upgrading Porter to v0.8.0-beta1. This doesn't reproduce on every machine but it solidly reproduces every time Porter's build server unfortunately which has blocked me for a week tracking it down while we tried to upgrade to go mods and v0.8.0-beta1. 😅

When the docker driver calls ContainerWait it doesn't always return now, causing that line of code to never exit and eventually hit our test timeouts.


2020-01-09T16:18:48.0251089Z goroutine 671 [IO wait, 16 minutes]:
2020-01-09T16:18:48.0251320Z internal/poll.runtime_pollWait(0x7f0ff1197870, 0x72, 0xffffffffffffffff)
2020-01-09T16:18:48.0251578Z    /usr/local/go1.13/src/runtime/netpoll.go:184 +0x55
2020-01-09T16:18:48.0251850Z internal/poll.(*pollDesc).wait(0xc000345118, 0x72, 0x1000, 0x1000, 0xffffffffffffffff)
2020-01-09T16:18:48.0252070Z    /usr/local/go1.13/src/internal/poll/fd_poll_runtime.go:87 +0x45
2020-01-09T16:18:48.0252327Z internal/poll.(*pollDesc).waitRead(...)
2020-01-09T16:18:48.0252579Z    /usr/local/go1.13/src/internal/poll/fd_poll_runtime.go:92
2020-01-09T16:18:48.0252791Z internal/poll.(*FD).Read(0xc000345100, 0xc000147000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
2020-01-09T16:18:48.0253060Z    /usr/local/go1.13/src/internal/poll/fd_unix.go:169 +0x1cf
2020-01-09T16:18:48.0253336Z net.(*netFD).Read(0xc000345100, 0xc000147000, 0x1000, 0x1000, 0x0, 0xc000000000, 0xc000000001)
2020-01-09T16:18:48.0253599Z    /usr/local/go1.13/src/net/fd_unix.go:202 +0x4f
2020-01-09T16:18:48.0255439Z net.(*conn).Read(0xc0000ba0c8, 0xc000147000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
2020-01-09T16:18:48.0255850Z    /usr/local/go1.13/src/net/net.go:184 +0x68
2020-01-09T16:18:48.0256138Z net/http.(*persistConn).Read(0xc0000e4c60, 0xc000147000, 0x1000, 0x1000, 0xd09907, 0x100c000300ce0, 0x101000100)
2020-01-09T16:18:48.0256430Z    /usr/local/go1.13/src/net/http/transport.go:1744 +0x75
2020-01-09T16:18:48.0256681Z bufio.(*Reader).fill(0xc0000bd080)
2020-01-09T16:18:48.0256917Z    /usr/local/go1.13/src/bufio/bufio.go:100 +0x103
2020-01-09T16:18:48.0257183Z bufio.(*Reader).ReadSlice(0xc0000bd080, 0xa, 0x374ed80, 0x271d300, 0xc0000bfe40, 0xc0001bc960, 0xd01c0b)
2020-01-09T16:18:48.0257383Z    /usr/local/go1.13/src/bufio/bufio.go:359 +0x3d
2020-01-09T16:18:48.0257645Z net/http/internal.readChunkLine(0xc0000bd080, 0x20, 0xc000236401, 0x26442b97e995ccd6, 0xc0002363f0, 0xce4fef)
2020-01-09T16:18:48.0257923Z    /usr/local/go1.13/src/net/http/internal/chunked.go:122 +0x34
2020-01-09T16:18:48.0258145Z net/http/internal.(*chunkedReader).beginChunk(0xc0023d21e0)
2020-01-09T16:18:48.0258393Z    /usr/local/go1.13/src/net/http/internal/chunked.go:48 +0x32
2020-01-09T16:18:48.0258678Z net/http/internal.(*chunkedReader).Read(0xc0023d21e0, 0xc00001c000, 0x200, 0x200, 0xc0002364a0, 0x1a9000cce945, 0x2000)
2020-01-09T16:18:48.0259146Z    /usr/local/go1.13/src/net/http/internal/chunked.go:93 +0x132
2020-01-09T16:18:48.0259437Z net/http.(*body).readLocked(0xc00037a440, 0xc00001c000, 0x200, 0x200, 0xcce2d5, 0xc00001c000, 0x7f0ff95ef200)
2020-01-09T16:18:48.0259800Z    /usr/local/go1.13/src/net/http/transfer.go:847 +0x5f
2020-01-09T16:18:48.0260048Z net/http.(*body).Read(0xc00037a440, 0xc00001c000, 0x200, 0x200, 0x0, 0x0, 0x0)
2020-01-09T16:18:48.0260439Z    /usr/local/go1.13/src/net/http/transfer.go:839 +0x102
2020-01-09T16:18:48.0260755Z net/http.(*bodyEOFSignal).Read(0xc00037a580, 0xc00001c000, 0x200, 0x200, 0x0, 0x0, 0x0)
2020-01-09T16:18:48.0261000Z    /usr/local/go1.13/src/net/http/transport.go:2568 +0xe5
2020-01-09T16:18:48.0261209Z encoding/json.(*Decoder).refill(0xc0003e66e0, 0xc0003e66e0, 0xcb5d6b)
2020-01-09T16:18:48.0261410Z    /usr/local/go1.13/src/encoding/json/stream.go:161 +0xeb
2020-01-09T16:18:48.0261647Z encoding/json.(*Decoder).readValue(0xc0003e66e0, 0x0, 0x0, 0x237d2a0)
2020-01-09T16:18:48.0261901Z    /usr/local/go1.13/src/encoding/json/stream.go:136 +0x1dc
2020-01-09T16:18:48.0262156Z encoding/json.(*Decoder).Decode(0xc0003e66e0, 0x20e2ec0, 0xc0003801d0, 0x26c7100, 0xc00037a580)
2020-01-09T16:18:48.0262348Z    /usr/local/go1.13/src/encoding/json/stream.go:63 +0x79
2020-01-09T16:18:48.0262580Z github.com/docker/docker/client.(*Client).ContainerWait.func1(0x26e2100, 0xc00037a580, 0xc0023d2060, 0xc8, 0xc000344100, 0xc00039f440, 0xc00009cba0)
2020-01-09T16:18:48.0264340Z    /home/vsts/work/1/s/gopath/pkg/mod/github.com/moby/moby@v0.7.3-0.20190826074503-38ab9da00309/client/container_wait.go:48 +0x136
2020-01-09T16:18:48.0265611Z created by github.com/docker/docker/client.(*Client).ContainerWait
2020-01-09T16:18:48.0267334Z    /home/vsts/work/1/s/gopath/pkg/mod/github.com/moby/moby@v0.7.3-0.20190826074503-38ab9da00309/client/container_wait.go:45 +0x480

I am not sure on the motivation behind the change in #169. We had never seen the bug that it described, long running bundles were waiting just fine.

I have found two fixes for the regression that unblock Porter from upgrading to v0.8.0-beta1:

silvin-lubecki commented 4 years ago

@carolynvs I confirm we have recurrent random timeouts on our tests since we use v0.8.0-beta1.

fenngwd commented 4 years ago

@carolynvs thanks for the fix. I still want to know why previous condition(WaitConditionNextExit) will make ContainerWait hang, since I don't see any blocking code in ContainerWait method. It makes me confused.

carolynvs commented 4 years ago

Both Porter's and Docker's CI experienced this. I don't want to spend any additional time on this issue, but it was quite clear that this line of code was causing go routines to not complete. I can flip this PR (or the original referenced) on/off and reproduce the regression on demand. On Porter's build server we were seeing go routines live however long our test timeout would allow them and when killed the routines that were being killed were the stdout copy from a bit higher up in that function and this container wait line.

Seems like this WaitConditionNextExit was not the correct condition to use when the container was started before we began to wait for the container. So in some cases the condition that we were waiting for was already reached by the time we began to wait with that statement.

If you want to dig in further, feel free to do so but after a week of troubleshooting, I'm beat and just happy to be able to move on. 😅