Closed jwreagor closed 7 years ago
I was able to patch the signal path and still looking at the others. The other goroutines appear to be jobs not closing when a reload occurs because I was able to remove 2 more by switching jobs back to using the bus for shutdown. This feel anecdotal to prior experiences with the reload endpoint during development of 3.5.0. We could very well not be completing these goroutines on reload.
I also tested previous releases of CP. The fact that reloading was commonly broken before 3.5.0 is making me feel like having it consistently functional now is progress toward cleaning this type of stuff up. I saw similar memory stats but not solid evidence.
Still looking at options.
I'm beginning to push through the fix for the reload handler and get a patch release out this afternoon.
It looks like the remaining increasing goroutines are stemming out of logrus and not the actual job goroutines (which are closing BTW). I could be wrong and maybe these are eventually cleaned up. It would be surprising if this was suddenly an issue.
goroutine 43 [semacquire]:
sync.runtime_notifyListWait(0xc42010e400, 0xc400000001)
/usr/local/Cellar/go/1.9.1/libexec/src/runtime/sema.go:507 +0x110
sync.(*Cond).Wait(0xc42010e3f0)
/usr/local/Cellar/go/1.9.1/libexec/src/sync/cond.go:56 +0x80
io.(*pipe).read(0xc42010e3c0, 0xc4202f3061, 0xf9f, 0xf9f, 0x0, 0x0, 0x0)
/usr/local/Cellar/go/1.9.1/libexec/src/io/pipe.go:47 +0xc6
io.(*PipeReader).Read(0xc42000e400, 0xc4202f3061, 0xf9f, 0xf9f, 0xc42012d5e0, 0xc420379450, 0x500)
/usr/local/Cellar/go/1.9.1/libexec/src/io/pipe.go:130 +0x4c
bufio.(*Scanner).Scan(0xc4204b4f38, 0x1)
/usr/local/Cellar/go/1.9.1/libexec/src/bufio/scan.go:207 +0xaf
github.com/joyent/containerpilot/vendor/github.com/sirupsen/logrus.(*Entry).writerScanner(0xc420148188, 0xc42000e400, 0xc420299750)
/Users/justinreagor/go/src/github.com/joyent/containerpilot/vendor/github.com/sirupsen/logrus/writer.go:51 +0xab
created by github.com/joyent/containerpilot/vendor/github.com/sirupsen/logrus.(*Entry).WriterLevel
/Users/justinreagor/go/src/github.com/joyent/containerpilot/vendor/github.com/sirupsen/logrus/writer.go:43 +0x174
The only real answer will come out of the testing rig mentioned in #522 and formalizing @tgross's prior leak investigation, which is something I've been meaning to do for weeks now.
Here's the current data I've gathered.
Given the following config file we can expect several types of goroutines to be alive at various moments during run time.
{
consul: "consul:8500",
logging: {
level: "DEBUG",
format: "text"
},
jobs: [
{
name: "test-thing",
exec: "tail -f"
}
],
control: {
socket: "/tmp/cp-single.socket"
}
}
This stack trace was pulled in the middle of CP's event loop. You can clearly see the goroutines you'd expect before reloading ContainerPilot, one for Job, one for the control server listen and shutdown, signal handler, pprof, etc.
However, with this stack trace which was collected at the end of CP's event loop (after reloading), you can clearly see an increasing amount (4, after several reloads) of goroutines involving logrus. This is what I'm investigating.
I'll handle those logging goroutines as a separate matter.
I just noticed this mentioned in a comment during the previous memory leak issue so, in hindsight, I feel much better.
The number associated with
logrus.(*Entry).writerScanner
churns a lot but everything else is steady-state.
Good to know I wasn't the only one seeing logrus
churning.
Outputting
runtime.NumGoroutine()
at the beginning and end of the reload cycle displays an ever increasing number of goroutines. Grabbing a stack trace of all goroutines shows the following two areas that need work (the other 3 goroutines not shown do NOT increase over reloads).The latter is a bug in the signal handling path and was definitely introduced by me at v3.5.0. This should be easy to fix.
I'm not sure yet why the former notes
logrus
and increases by 2 goroutines every reload. I'll continue reviewing.