Reloads are leaking goroutines

jwreagor commented 7 years ago

Outputting runtime.NumGoroutine() at the beginning and end of the reload cycle displays an ever increasing number of goroutines. Grabbing a stack trace of all goroutines shows the following two areas that need work (the other 3 goroutines not shown do NOT increase over reloads).

goroutine profile: total 12
6 @ 0x102ec5c 0x102ed4e 0x1040f00 0x10610d0 0x1065966 0x106605c 0x11049bf 0x114a87b 0x105aef1
#       0x1040eff       sync.runtime_notifyListWait+0x10f                                                               /usr/local/Cellar/go/1.9.1/libexec/src/runtime/sema.go:507
#       0x10610cf       sync.(*Cond).Wait+0x7f                                                                          /usr/local/Cellar/go/1.9.1/libexec/src/sync/cond.go:56
#       0x1065965       io.(*pipe).read+0xc5                                                                            /usr/local/Cellar/go/1.9.1/libexec/src/io/pipe.go:47
#       0x106605b       io.(*PipeReader).Read+0x4b                                                                      /usr/local/Cellar/go/1.9.1/libexec/src/io/pipe.go:130
#       0x11049be       bufio.(*Scanner).Scan+0xae                                                                      /usr/local/Cellar/go/1.9.1/libexec/src/bufio/scan.go:207
#       0x114a87a       github.com/joyent/containerpilot/vendor/github.com/sirupsen/logrus.(*Entry).writerScanner+0xaa  /Users/justinreagor/go/src/github.com/joyent/containerpilot/vendor/github.com/sirupsen/logrus/writer.go:51

3 @ 0x102ec5c 0x102ed4e 0x1006584 0x100626b 0x13c432b 0x105aef1
#       0x13c432a       github.com/joyent/containerpilot/core.(*App).handleSignals.func1+0x4a   /Users/justinreagor/go/src/github.com/joyent/containerpilot/core/signals.go:15

The latter is a bug in the signal handling path and was definitely introduced by me at v3.5.0. This should be easy to fix.

I'm not sure yet why the former notes logrus and increases by 2 goroutines every reload. I'll continue reviewing.

jwreagor commented 7 years ago

I was able to patch the signal path and still looking at the others. The other goroutines appear to be jobs not closing when a reload occurs because I was able to remove 2 more by switching jobs back to using the bus for shutdown. This feel anecdotal to prior experiences with the reload endpoint during development of 3.5.0. We could very well not be completing these goroutines on reload.

I also tested previous releases of CP. The fact that reloading was commonly broken before 3.5.0 is making me feel like having it consistently functional now is progress toward cleaning this type of stuff up. I saw similar memory stats but not solid evidence.

Still looking at options.

jwreagor commented 7 years ago

I'm beginning to push through the fix for the reload handler and get a patch release out this afternoon.

It looks like the remaining increasing goroutines are stemming out of logrus and not the actual job goroutines (which are closing BTW). I could be wrong and maybe these are eventually cleaned up. It would be surprising if this was suddenly an issue.

goroutine 43 [semacquire]:
sync.runtime_notifyListWait(0xc42010e400, 0xc400000001)
        /usr/local/Cellar/go/1.9.1/libexec/src/runtime/sema.go:507 +0x110
sync.(*Cond).Wait(0xc42010e3f0)
        /usr/local/Cellar/go/1.9.1/libexec/src/sync/cond.go:56 +0x80
io.(*pipe).read(0xc42010e3c0, 0xc4202f3061, 0xf9f, 0xf9f, 0x0, 0x0, 0x0)
        /usr/local/Cellar/go/1.9.1/libexec/src/io/pipe.go:47 +0xc6
io.(*PipeReader).Read(0xc42000e400, 0xc4202f3061, 0xf9f, 0xf9f, 0xc42012d5e0, 0xc420379450, 0x500)
        /usr/local/Cellar/go/1.9.1/libexec/src/io/pipe.go:130 +0x4c
bufio.(*Scanner).Scan(0xc4204b4f38, 0x1)
        /usr/local/Cellar/go/1.9.1/libexec/src/bufio/scan.go:207 +0xaf
github.com/joyent/containerpilot/vendor/github.com/sirupsen/logrus.(*Entry).writerScanner(0xc420148188, 0xc42000e400, 0xc420299750)
        /Users/justinreagor/go/src/github.com/joyent/containerpilot/vendor/github.com/sirupsen/logrus/writer.go:51 +0xab
created by github.com/joyent/containerpilot/vendor/github.com/sirupsen/logrus.(*Entry).WriterLevel
        /Users/justinreagor/go/src/github.com/joyent/containerpilot/vendor/github.com/sirupsen/logrus/writer.go:43 +0x174

The only real answer will come out of the testing rig mentioned in #522 and formalizing @tgross's prior leak investigation, which is something I've been meaning to do for weeks now.

jwreagor commented 7 years ago

Here's the current data I've gathered.

Given the following config file we can expect several types of goroutines to be alive at various moments during run time.

{
  consul: "consul:8500",
  logging: {
    level: "DEBUG",
    format: "text"
  },
  jobs: [
    {
      name: "test-thing",
      exec: "tail -f"
    }
  ],
  control: {
    socket: "/tmp/cp-single.socket"
  }
}

This stack trace was pulled in the middle of CP's event loop. You can clearly see the goroutines you'd expect before reloading ContainerPilot, one for Job, one for the control server listen and shutdown, signal handler, pprof, etc.

However, with this stack trace which was collected at the end of CP's event loop (after reloading), you can clearly see an increasing amount (4, after several reloads) of goroutines involving logrus. This is what I'm investigating.

jwreagor commented 7 years ago

I'll handle those logging goroutines as a separate matter.

jwreagor commented 7 years ago

I just noticed this mentioned in a comment during the previous memory leak issue so, in hindsight, I feel much better.

The number associated with logrus.(*Entry).writerScanner churns a lot but everything else is steady-state.

Good to know I wasn't the only one seeing logrus churning.

TritonDataCenter / containerpilot

Reloads are leaking goroutines #521