Closed jason-o-matic closed 8 years ago
👍
I believe what I'm seeing is the monitor process, but might want to set a modeline(?) on that (87670 in the pic):
Anyway, testing this with kill
, everything seems to be cool and restart appropriately, or in the case of killing the main instrumentald process, bringing everything else offline. 👍 other than the process name thing above.
Btw, if you kill the unnamed process, when you terminate the main instrumentald
process you get some output like this: telegraf execution failed, 1 total failures
@janxious I thought the process naming thing was weird because https://github.com/Instrumental/instrumentald/commit/35ec47db6aeeddff147faab497ddfbd473c784b6#diff-2fb50e9924aa3f9c12ff26be23d100e0R235 should have put "[Monitor]" at the beginning of the command line.
Once I dug in, though, I discovered you found a separate issue. The (ruby)
process was actually a zombie process created when daemonizing telegraf. Turns out I just had to add a Process.detach
to fix it.
Wanna give 'er the ol' re-review?
Using signal 9 or 3, this orphans the telegraf process and the starting ruby process does nothing to restore the monitor when it's gone.
Using signal 9 or 3, the telegraf process dies, and the monitor process restarts
Using signal 9 or 3, everything dies within 15s.
Here's some fun output:
joel@hel [cleanup_telegraf_process *+] instrumentald $ SIGQUIT: quit
PC=0x4064ceb m=0
goroutine 0 [idle]:
runtime.mach_semaphore_wait(0x1103, 0x6659360, 0x56db320, 0x4026184, 0x56e0760, 0x56db320, 0x4058309, 0xffffffffffffffff, 0x56db320, 0x7fff5fbff244, ...)
/usr/local/Cellar/go/1.6.2/libexec/src/runtime/sys_darwin_amd64.s:411 +0xb
runtime.semasleep1(0xffffffffffffffff, 0x56db320)
/usr/local/Cellar/go/1.6.2/libexec/src/runtime/os1_darwin.go:423 +0xdf
runtime.semasleep.func1()
/usr/local/Cellar/go/1.6.2/libexec/src/runtime/os1_darwin.go:439 +0x29
runtime.systemstack(0x7fff5fbff248)
/usr/local/Cellar/go/1.6.2/libexec/src/runtime/asm_amd64.s:307 +0xab
runtime.semasleep(0xffffffffffffffff, 0x0)
/usr/local/Cellar/go/1.6.2/libexec/src/runtime/os1_darwin.go:440 +0x36
runtime.notesleep(0x56dc0e8)
/usr/local/Cellar/go/1.6.2/libexec/src/runtime/lock_sema.go:166 +0xed
runtime.stopm()
/usr/local/Cellar/go/1.6.2/libexec/src/runtime/proc.go:1538 +0x10b
runtime.findrunnable(0xc820025500, 0x0)
/usr/local/Cellar/go/1.6.2/libexec/src/runtime/proc.go:1976 +0x739
runtime.schedule()
/usr/local/Cellar/go/1.6.2/libexec/src/runtime/proc.go:2075 +0x24f
runtime.goexit0(0xc820622180)
/usr/local/Cellar/go/1.6.2/libexec/src/runtime/proc.go:2210 +0x1f9
runtime.mcall(0x7fff5fbff3d0)
/usr/local/Cellar/go/1.6.2/libexec/src/runtime/asm_amd64.s:233 +0x5b
goroutine 1 [semacquire]:
sync.runtime_Semacquire(0xc8200135ac)
/usr/local/Cellar/go/1.6.2/libexec/src/runtime/sema.go:47 +0x26
sync.(*WaitGroup).Wait(0xc8200135a0)
/usr/local/Cellar/go/1.6.2/libexec/src/sync/waitgroup.go:127 +0xb4
github.com/influxdata/telegraf/agent.(*Agent).Run(0xc8204ae030, 0xc8200102a0, 0x0, 0x0)
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:339 +0xa0b
main.main()
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:274 +0x2624
goroutine 17 [syscall, locked to thread]:
runtime.goexit()
/usr/local/Cellar/go/1.6.2/libexec/src/runtime/asm_amd64.s:1998 +0x1
goroutine 5 [syscall]:
os/signal.signal_recv(0x0)
/usr/local/Cellar/go/1.6.2/libexec/src/runtime/sigqueue.go:116 +0x132
os/signal.loop()
/usr/local/Cellar/go/1.6.2/libexec/src/os/signal/signal_unix.go:22 +0x18
created by os/signal.init.1
/usr/local/Cellar/go/1.6.2/libexec/src/os/signal/signal_unix.go:28 +0x37
goroutine 50 [select, locked to thread]:
runtime.gopark(0x5100148, 0xc820408728, 0x4e8f5f0, 0x6, 0x18, 0x2)
/usr/local/Cellar/go/1.6.2/libexec/src/runtime/proc.go:262 +0x163
runtime.selectgoImpl(0xc820408728, 0x0, 0x18)
/usr/local/Cellar/go/1.6.2/libexec/src/runtime/select.go:392 +0xa67
runtime.selectgo(0xc820408728)
/usr/local/Cellar/go/1.6.2/libexec/src/runtime/select.go:215 +0x12
runtime.ensureSigM.func1()
/usr/local/Cellar/go/1.6.2/libexec/src/runtime/signal1_unix.go:279 +0x32c
runtime.goexit()
/usr/local/Cellar/go/1.6.2/libexec/src/runtime/asm_amd64.s:1998 +0x1
goroutine 51 [chan receive]:
main.main.func2(0xc820010300, 0xc8200102a0, 0xc82020e540)
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:246 +0x47
created by main.main
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:256 +0x1ef1
goroutine 19 [select]:
github.com/influxdata/telegraf/agent.(*Agent).flusher(0xc8204ae030, 0xc8200102a0, 0xc820010660, 0x0, 0x0)
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:262 +0x36a
github.com/influxdata/telegraf/agent.(*Agent).Run.func1(0xc8200135a0, 0xc8204ae030, 0xc8200102a0, 0xc820010660)
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:318 +0x7b
created by github.com/influxdata/telegraf/agent.(*Agent).Run
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:322 +0x8d9
goroutine 52 [select]:
github.com/influxdata/telegraf/agent.(*Agent).gatherer(0xc8204ae030, 0xc8200102a0, 0xc8201906c0, 0x6fc23ac00, 0xc820010660, 0x0, 0x0)
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:139 +0x57e
github.com/influxdata/telegraf/agent.(*Agent).Run.func2(0xc8200135a0, 0xc8204ae030, 0xc8200102a0, 0xc820010660, 0xc8201906c0, 0x6fc23ac00)
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:333 +0x7e
created by github.com/influxdata/telegraf/agent.(*Agent).Run
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:336 +0x9d0
goroutine 53 [select]:
github.com/influxdata/telegraf/agent.(*Agent).gatherer(0xc8204ae030, 0xc8200102a0, 0xc820190720, 0x6fc23ac00, 0xc820010660, 0x0, 0x0)
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:139 +0x57e
github.com/influxdata/telegraf/agent.(*Agent).Run.func2(0xc8200135a0, 0xc8204ae030, 0xc8200102a0, 0xc820010660, 0xc820190720, 0x6fc23ac00)
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:333 +0x7e
created by github.com/influxdata/telegraf/agent.(*Agent).Run
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:336 +0x9d0
goroutine 54 [select]:
github.com/influxdata/telegraf/agent.(*Agent).gatherer(0xc8204ae030, 0xc8200102a0, 0xc8201907b0, 0x6fc23ac00, 0xc820010660, 0x0, 0x0)
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:139 +0x57e
github.com/influxdata/telegraf/agent.(*Agent).Run.func2(0xc8200135a0, 0xc8204ae030, 0xc8200102a0, 0xc820010660, 0xc8201907b0, 0x6fc23ac00)
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:333 +0x7e
created by github.com/influxdata/telegraf/agent.(*Agent).Run
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:336 +0x9d0
goroutine 55 [select]:
github.com/influxdata/telegraf/agent.(*Agent).gatherer(0xc8204ae030, 0xc8200102a0, 0xc820190810, 0x6fc23ac00, 0xc820010660, 0x0, 0x0)
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:139 +0x57e
github.com/influxdata/telegraf/agent.(*Agent).Run.func2(0xc8200135a0, 0xc8204ae030, 0xc8200102a0, 0xc820010660, 0xc820190810, 0x6fc23ac00)
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:333 +0x7e
created by github.com/influxdata/telegraf/agent.(*Agent).Run
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:336 +0x9d0
goroutine 56 [select]:
github.com/influxdata/telegraf/agent.(*Agent).gatherer(0xc8204ae030, 0xc8200102a0, 0xc820190870, 0x6fc23ac00, 0xc820010660, 0x0, 0x0)
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:139 +0x57e
github.com/influxdata/telegraf/agent.(*Agent).Run.func2(0xc8200135a0, 0xc8204ae030, 0xc8200102a0, 0xc820010660, 0xc820190870, 0x6fc23ac00)
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:333 +0x7e
created by github.com/influxdata/telegraf/agent.(*Agent).Run
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:336 +0x9d0
goroutine 57 [select]:
github.com/influxdata/telegraf/agent.(*Agent).gatherer(0xc8204ae030, 0xc8200102a0, 0xc8201908d0, 0x6fc23ac00, 0xc820010660, 0x0, 0x0)
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:139 +0x57e
github.com/influxdata/telegraf/agent.(*Agent).Run.func2(0xc8200135a0, 0xc8204ae030, 0xc8200102a0, 0xc820010660, 0xc8201908d0, 0x6fc23ac00)
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:333 +0x7e
created by github.com/influxdata/telegraf/agent.(*Agent).Run
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:336 +0x9d0
goroutine 58 [select]:
github.com/influxdata/telegraf/agent.(*Agent).gatherer(0xc8204ae030, 0xc8200102a0, 0xc820190930, 0x6fc23ac00, 0xc820010660, 0x0, 0x0)
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:139 +0x57e
github.com/influxdata/telegraf/agent.(*Agent).Run.func2(0xc8200135a0, 0xc8204ae030, 0xc8200102a0, 0xc820010660, 0xc820190930, 0x6fc23ac00)
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:333 +0x7e
created by github.com/influxdata/telegraf/agent.(*Agent).Run
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:336 +0x9d0
goroutine 59 [select]:
github.com/influxdata/telegraf/agent.(*Agent).gatherer(0xc8204ae030, 0xc8200102a0, 0xc820190990, 0x6fc23ac00, 0xc820010660, 0x0, 0x0)
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:139 +0x57e
github.com/influxdata/telegraf/agent.(*Agent).Run.func2(0xc8200135a0, 0xc8204ae030, 0xc8200102a0, 0xc820010660, 0xc820190990, 0x6fc23ac00)
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:333 +0x7e
created by github.com/influxdata/telegraf/agent.(*Agent).Run
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:336 +0x9d0
goroutine 60 [select]:
github.com/influxdata/telegraf/agent.(*Agent).gatherer(0xc8204ae030, 0xc8200102a0, 0xc820190a20, 0x6fc23ac00, 0xc820010660, 0x0, 0x0)
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:139 +0x57e
github.com/influxdata/telegraf/agent.(*Agent).Run.func2(0xc8200135a0, 0xc8204ae030, 0xc8200102a0, 0xc820010660, 0xc820190a20, 0x6fc23ac00)
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:333 +0x7e
created by github.com/influxdata/telegraf/agent.(*Agent).Run
/Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:336 +0x9d0
rax 0xe
rbx 0x56dbfe0
rcx 0x7fff5fbff1d0
rdx 0x7fff5fbff248
rdi 0x1103
rsi 0x56db320
rbp 0x1103
rsp 0x7fff5fbff1d0
r8 0x56dbfe0
r9 0xc820026a00
r10 0xc820025ac0
r11 0x286
r12 0xe12928b4edd5
r13 0xe642c2bbe1f8
r14 0x1468e127b5690600
r15 0x56dac60
rip 0x4064ceb
rflags 0x286
cs 0x7
fs 0x0
gs 0x0
This does the thing it says. This as it exists is 👍 from me.
I'm not sure what we should do with the Monitor process, so we should probably improve the way the monitor monitoring is setup? Thoughts @jason-o-matic?
@janxious you're talking about Case 4, yeah? I'm not sure what you mean by "what we should do with the Monitor process". What happened that wasn't expected?
Yeah, case 4. It seems like killing the monitor should take out telegraf and/or the starting process
I suppose we could handle non-KILL signals, but there's nothing we can do with KILLs. I'm not too worried about it since you're already off the rails if you'll killing the monitor process.
If we wanted to be really fancy we could have the starting process monitor the monitor and vice versa, but that seems too complicated for where we're at right now.
The watcher-watchers, clearly. Just thought you should know the results. I am okay with "nothing" for what do we do now.
If we want to do something about cross-supervision, that is for another PR.
This should prevent issues we were seeing on linux of instrumentald dying and leaving telegraf running. It sets up a monitor process that runs telegraf and watches the instrumentald process so when instrumentald exits the monitor process can kill telegraf, then exit itself.