Instrumental / instrumentald

Instrumental System and Service Daemon
MIT License
14 stars 3 forks source link

Updated instrumentald to run telegraf in a way that's easier to clean up. #13

Closed jason-o-matic closed 8 years ago

jason-o-matic commented 8 years ago

This should prevent issues we were seeing on linux of instrumentald dying and leaving telegraf running. It sets up a monitor process that runs telegraf and watches the instrumentald process so when instrumentald exits the monitor process can kill telegraf, then exit itself.

esquivalient commented 8 years ago

👍

janxious commented 8 years ago

I believe what I'm seeing is the monitor process, but might want to set a modeline(?) on that (87670 in the pic):

screen shot 2016-07-18 at 3 23 02 pm
janxious commented 8 years ago

Anyway, testing this with kill, everything seems to be cool and restart appropriately, or in the case of killing the main instrumentald process, bringing everything else offline. 👍 other than the process name thing above.

janxious commented 8 years ago

Btw, if you kill the unnamed process, when you terminate the main instrumentald process you get some output like this: telegraf execution failed, 1 total failures

jason-o-matic commented 8 years ago

@janxious I thought the process naming thing was weird because https://github.com/Instrumental/instrumentald/commit/35ec47db6aeeddff147faab497ddfbd473c784b6#diff-2fb50e9924aa3f9c12ff26be23d100e0R235 should have put "[Monitor]" at the beginning of the command line.

Once I dug in, though, I discovered you found a separate issue. The (ruby) process was actually a zombie process created when daemonizing telegraf. Turns out I just had to add a Process.detach to fix it.

Wanna give 'er the ol' re-review?

janxious commented 8 years ago

Case 1: Kill Monitor Process

Using signal 9 or 3, this orphans the telegraf process and the starting ruby process does nothing to restore the monitor when it's gone.

Case 2: Kill telegraf process

Using signal 9 or 3, the telegraf process dies, and the monitor process restarts

Case 3: Kill starting ruby process

Using signal 9 or 3, everything dies within 15s.

Case 4: Kill Monitor Process, then Kill starting ruby process

Here's some fun output:

joel@hel [cleanup_telegraf_process *+] instrumentald $ SIGQUIT: quit
PC=0x4064ceb m=0

goroutine 0 [idle]:
runtime.mach_semaphore_wait(0x1103, 0x6659360, 0x56db320, 0x4026184, 0x56e0760, 0x56db320, 0x4058309, 0xffffffffffffffff, 0x56db320, 0x7fff5fbff244, ...)
    /usr/local/Cellar/go/1.6.2/libexec/src/runtime/sys_darwin_amd64.s:411 +0xb
runtime.semasleep1(0xffffffffffffffff, 0x56db320)
    /usr/local/Cellar/go/1.6.2/libexec/src/runtime/os1_darwin.go:423 +0xdf
runtime.semasleep.func1()
    /usr/local/Cellar/go/1.6.2/libexec/src/runtime/os1_darwin.go:439 +0x29
runtime.systemstack(0x7fff5fbff248)
    /usr/local/Cellar/go/1.6.2/libexec/src/runtime/asm_amd64.s:307 +0xab
runtime.semasleep(0xffffffffffffffff, 0x0)
    /usr/local/Cellar/go/1.6.2/libexec/src/runtime/os1_darwin.go:440 +0x36
runtime.notesleep(0x56dc0e8)
    /usr/local/Cellar/go/1.6.2/libexec/src/runtime/lock_sema.go:166 +0xed
runtime.stopm()
    /usr/local/Cellar/go/1.6.2/libexec/src/runtime/proc.go:1538 +0x10b
runtime.findrunnable(0xc820025500, 0x0)
    /usr/local/Cellar/go/1.6.2/libexec/src/runtime/proc.go:1976 +0x739
runtime.schedule()
    /usr/local/Cellar/go/1.6.2/libexec/src/runtime/proc.go:2075 +0x24f
runtime.goexit0(0xc820622180)
    /usr/local/Cellar/go/1.6.2/libexec/src/runtime/proc.go:2210 +0x1f9
runtime.mcall(0x7fff5fbff3d0)
    /usr/local/Cellar/go/1.6.2/libexec/src/runtime/asm_amd64.s:233 +0x5b

goroutine 1 [semacquire]:
sync.runtime_Semacquire(0xc8200135ac)
    /usr/local/Cellar/go/1.6.2/libexec/src/runtime/sema.go:47 +0x26
sync.(*WaitGroup).Wait(0xc8200135a0)
    /usr/local/Cellar/go/1.6.2/libexec/src/sync/waitgroup.go:127 +0xb4
github.com/influxdata/telegraf/agent.(*Agent).Run(0xc8204ae030, 0xc8200102a0, 0x0, 0x0)
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:339 +0xa0b
main.main()
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:274 +0x2624

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
    /usr/local/Cellar/go/1.6.2/libexec/src/runtime/asm_amd64.s:1998 +0x1

goroutine 5 [syscall]:
os/signal.signal_recv(0x0)
    /usr/local/Cellar/go/1.6.2/libexec/src/runtime/sigqueue.go:116 +0x132
os/signal.loop()
    /usr/local/Cellar/go/1.6.2/libexec/src/os/signal/signal_unix.go:22 +0x18
created by os/signal.init.1
    /usr/local/Cellar/go/1.6.2/libexec/src/os/signal/signal_unix.go:28 +0x37

goroutine 50 [select, locked to thread]:
runtime.gopark(0x5100148, 0xc820408728, 0x4e8f5f0, 0x6, 0x18, 0x2)
    /usr/local/Cellar/go/1.6.2/libexec/src/runtime/proc.go:262 +0x163
runtime.selectgoImpl(0xc820408728, 0x0, 0x18)
    /usr/local/Cellar/go/1.6.2/libexec/src/runtime/select.go:392 +0xa67
runtime.selectgo(0xc820408728)
    /usr/local/Cellar/go/1.6.2/libexec/src/runtime/select.go:215 +0x12
runtime.ensureSigM.func1()
    /usr/local/Cellar/go/1.6.2/libexec/src/runtime/signal1_unix.go:279 +0x32c
runtime.goexit()
    /usr/local/Cellar/go/1.6.2/libexec/src/runtime/asm_amd64.s:1998 +0x1

goroutine 51 [chan receive]:
main.main.func2(0xc820010300, 0xc8200102a0, 0xc82020e540)
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:246 +0x47
created by main.main
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:256 +0x1ef1

goroutine 19 [select]:
github.com/influxdata/telegraf/agent.(*Agent).flusher(0xc8204ae030, 0xc8200102a0, 0xc820010660, 0x0, 0x0)
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:262 +0x36a
github.com/influxdata/telegraf/agent.(*Agent).Run.func1(0xc8200135a0, 0xc8204ae030, 0xc8200102a0, 0xc820010660)
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:318 +0x7b
created by github.com/influxdata/telegraf/agent.(*Agent).Run
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:322 +0x8d9

goroutine 52 [select]:
github.com/influxdata/telegraf/agent.(*Agent).gatherer(0xc8204ae030, 0xc8200102a0, 0xc8201906c0, 0x6fc23ac00, 0xc820010660, 0x0, 0x0)
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:139 +0x57e
github.com/influxdata/telegraf/agent.(*Agent).Run.func2(0xc8200135a0, 0xc8204ae030, 0xc8200102a0, 0xc820010660, 0xc8201906c0, 0x6fc23ac00)
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:333 +0x7e
created by github.com/influxdata/telegraf/agent.(*Agent).Run
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:336 +0x9d0

goroutine 53 [select]:
github.com/influxdata/telegraf/agent.(*Agent).gatherer(0xc8204ae030, 0xc8200102a0, 0xc820190720, 0x6fc23ac00, 0xc820010660, 0x0, 0x0)
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:139 +0x57e
github.com/influxdata/telegraf/agent.(*Agent).Run.func2(0xc8200135a0, 0xc8204ae030, 0xc8200102a0, 0xc820010660, 0xc820190720, 0x6fc23ac00)
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:333 +0x7e
created by github.com/influxdata/telegraf/agent.(*Agent).Run
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:336 +0x9d0

goroutine 54 [select]:
github.com/influxdata/telegraf/agent.(*Agent).gatherer(0xc8204ae030, 0xc8200102a0, 0xc8201907b0, 0x6fc23ac00, 0xc820010660, 0x0, 0x0)
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:139 +0x57e
github.com/influxdata/telegraf/agent.(*Agent).Run.func2(0xc8200135a0, 0xc8204ae030, 0xc8200102a0, 0xc820010660, 0xc8201907b0, 0x6fc23ac00)
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:333 +0x7e
created by github.com/influxdata/telegraf/agent.(*Agent).Run
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:336 +0x9d0

goroutine 55 [select]:
github.com/influxdata/telegraf/agent.(*Agent).gatherer(0xc8204ae030, 0xc8200102a0, 0xc820190810, 0x6fc23ac00, 0xc820010660, 0x0, 0x0)
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:139 +0x57e
github.com/influxdata/telegraf/agent.(*Agent).Run.func2(0xc8200135a0, 0xc8204ae030, 0xc8200102a0, 0xc820010660, 0xc820190810, 0x6fc23ac00)
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:333 +0x7e
created by github.com/influxdata/telegraf/agent.(*Agent).Run
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:336 +0x9d0

goroutine 56 [select]:
github.com/influxdata/telegraf/agent.(*Agent).gatherer(0xc8204ae030, 0xc8200102a0, 0xc820190870, 0x6fc23ac00, 0xc820010660, 0x0, 0x0)
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:139 +0x57e
github.com/influxdata/telegraf/agent.(*Agent).Run.func2(0xc8200135a0, 0xc8204ae030, 0xc8200102a0, 0xc820010660, 0xc820190870, 0x6fc23ac00)
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:333 +0x7e
created by github.com/influxdata/telegraf/agent.(*Agent).Run
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:336 +0x9d0

goroutine 57 [select]:
github.com/influxdata/telegraf/agent.(*Agent).gatherer(0xc8204ae030, 0xc8200102a0, 0xc8201908d0, 0x6fc23ac00, 0xc820010660, 0x0, 0x0)
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:139 +0x57e
github.com/influxdata/telegraf/agent.(*Agent).Run.func2(0xc8200135a0, 0xc8204ae030, 0xc8200102a0, 0xc820010660, 0xc8201908d0, 0x6fc23ac00)
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:333 +0x7e
created by github.com/influxdata/telegraf/agent.(*Agent).Run
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:336 +0x9d0

goroutine 58 [select]:
github.com/influxdata/telegraf/agent.(*Agent).gatherer(0xc8204ae030, 0xc8200102a0, 0xc820190930, 0x6fc23ac00, 0xc820010660, 0x0, 0x0)
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:139 +0x57e
github.com/influxdata/telegraf/agent.(*Agent).Run.func2(0xc8200135a0, 0xc8204ae030, 0xc8200102a0, 0xc820010660, 0xc820190930, 0x6fc23ac00)
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:333 +0x7e
created by github.com/influxdata/telegraf/agent.(*Agent).Run
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:336 +0x9d0

goroutine 59 [select]:
github.com/influxdata/telegraf/agent.(*Agent).gatherer(0xc8204ae030, 0xc8200102a0, 0xc820190990, 0x6fc23ac00, 0xc820010660, 0x0, 0x0)
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:139 +0x57e
github.com/influxdata/telegraf/agent.(*Agent).Run.func2(0xc8200135a0, 0xc8204ae030, 0xc8200102a0, 0xc820010660, 0xc820190990, 0x6fc23ac00)
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:333 +0x7e
created by github.com/influxdata/telegraf/agent.(*Agent).Run
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:336 +0x9d0

goroutine 60 [select]:
github.com/influxdata/telegraf/agent.(*Agent).gatherer(0xc8204ae030, 0xc8200102a0, 0xc820190a20, 0x6fc23ac00, 0xc820010660, 0x0, 0x0)
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:139 +0x57e
github.com/influxdata/telegraf/agent.(*Agent).Run.func2(0xc8200135a0, 0xc8204ae030, 0xc8200102a0, 0xc820010660, 0xc820190a20, 0x6fc23ac00)
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:333 +0x7e
created by github.com/influxdata/telegraf/agent.(*Agent).Run
    /Users/jqr/code/instrumental/daemon/src/github.com/influxdata/telegraf/agent/agent.go:336 +0x9d0

rax    0xe
rbx    0x56dbfe0
rcx    0x7fff5fbff1d0
rdx    0x7fff5fbff248
rdi    0x1103
rsi    0x56db320
rbp    0x1103
rsp    0x7fff5fbff1d0
r8     0x56dbfe0
r9     0xc820026a00
r10    0xc820025ac0
r11    0x286
r12    0xe12928b4edd5
r13    0xe642c2bbe1f8
r14    0x1468e127b5690600
r15    0x56dac60
rip    0x4064ceb
rflags 0x286
cs     0x7
fs     0x0
gs     0x0
janxious commented 8 years ago

This does the thing it says. This as it exists is 👍 from me.

I'm not sure what we should do with the Monitor process, so we should probably improve the way the monitor monitoring is setup? Thoughts @jason-o-matic?

jason-o-matic commented 8 years ago

@janxious you're talking about Case 4, yeah? I'm not sure what you mean by "what we should do with the Monitor process". What happened that wasn't expected?

janxious commented 8 years ago

Yeah, case 4. It seems like killing the monitor should take out telegraf and/or the starting process

jason-o-matic commented 8 years ago

I suppose we could handle non-KILL signals, but there's nothing we can do with KILLs. I'm not too worried about it since you're already off the rails if you'll killing the monitor process.

If we wanted to be really fancy we could have the starting process monitor the monitor and vice versa, but that seems too complicated for where we're at right now.

janxious commented 8 years ago

The watcher-watchers, clearly. Just thought you should know the results. I am okay with "nothing" for what do we do now.

janxious commented 8 years ago

If we want to do something about cross-supervision, that is for another PR.