DarthSim / overmind

Process manager for Procfile-based applications and tmux
MIT License
2.82k stars 79 forks source link

Overmind is crashing tmux on ^C #115

Closed bensherman closed 2 years ago

bensherman commented 3 years ago

If I fire up a bunch of processes, and hit ^C, overmind generally stops the underlying processes and everything is fine.

Beginning with a recent update to tmux (v3.2a), it has started causing tmux to crash, and overmind exits with the error "overmind: Tmux client unexpectidly exited". (the mispelling here was recently fixed, thanks!)

When I strace the processes, I see tmux is segfaulting

[pid 1708098] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x110} ---
[pid 1708098] +++ killed by SIGSEGV +++

This happens on both macOS and Linux machines with the latest versions of tmux. I am unable to reproduce it with versions prior to tmux 3.

I recompiled tmux to force logging, and it looks like the tmux client dies, a new one is spawned, but it can't connect, and then everything dies.

first client:

==> tmux-client-1725271.log <==
1629137113.489688 cmd_pack_argv: argv[271]=-P
1629137113.489690 cmd_pack_argv: argv[272]=-F
1629137113.489693 cmd_pack_argv: argv[273]=%overmind-process #{pane_id} #{window_name} #{pane_pid}
1629137113.489696 cmd_pack_argv: argv[274]=/tmp/overmind-github-x8_n2_bZ-ZxCnk6mLszW9J/aqueduct-bridge
1629137113.489698 cmd_pack_argv: argv[275]=;
1629137113.489708 sending message 200 to peer 0x55f3c8c0b8d0 (4536 bytes)
1629137113.489720 client loop enter
1629137113.489748 client_signal: Child exited
1629137113.971429 client_signal: Interrupt
1629137113.993676 client loop exit

second client:

==> tmux-client-1725969.log <==
1629137113.996576 client started (1725969): version next-3.3, socket /tmp/tmux-1001/overmind-github-x8_n2_bZ-ZxCnk6mLszW9J, protocol 8
1629137113.996595 on Linux 4.9.0-16-amd64 #1 SMP Debian 4.9.272-2 (2021-07-19)
1629137113.996600 using libevent 2.0.21-stable (poll); ncurses 6.0
1629137113.996618 flags are 0x8010000
1629137113.996622 socket is /tmp/tmux-1001/overmind-github-x8_n2_bZ-ZxCnk6mLszW9J
1629137113.996629 trying connect
1629137113.996639 connect failed: Connection refused

server:

==> tmux-server-1725277.log <==
1629137113.993069 server_client_check_pane_buffer: pane %9 is off
1629137113.993072 server_client_check_pane_buffer: client-1725271 has 168 bytes used and 0 left for %10
1629137113.993075 server_client_check_pane_buffer: %10 has 168 minimum (of 168) bytes used
1629137113.993078 server_client_check_pane_buffer: pane %10 is off
1629137113.993082 server_client_check_pane_buffer: client-1725271 has 168 bytes used and 0 left for %11
1629137113.993085 server_client_check_pane_buffer: %11 has 168 minimum (of 168) bytes used
1629137113.993087 server_client_check_pane_buffer: pane %11 is off
1629137113.993138 control_write_callback: client-1725271: 8192 bytes available, 10 panes
1629137113.993142 control_check_age: client-1725271: %0 is 21 behind
1629137113.993146 control_write_pending: client-1725271: output block 487 (age 21) for %0 (used 0/273)

I think there is a race condition somewhere, and I am having trouble nailing it down - any help would be appreciated. I'll open an issue over at tmux/tmux if I can reproduce this without overmind (which I have not yet been able to do).

bensherman commented 3 years ago

More testing - I created a Procfile with the following contents, but the scripts didn't exist yet, and I got the same behavior:

sleepy: ./snoozy.sh
snoozy: ./snoozy.sh

And more strace data around the segfault:

[pid 1819623] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 1819637] write(5, "1629152057.071116 control_write_"..., 103) = 103
strace: Process 1819664 attached
[pid 1819663] <... clone resumed> child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f4a3d224750) = 1819664
[pid 1819637] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x120} ---
[pid 1819623] <... nanosleep resumed> NULL) = 0
[pid 1819663] wait4(-1,  <unfinished ...>
[pid 1819664] close(10)                 = 0
[pid 1819623] futex(0x7f1178, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=3615844} <unfinished ...>
[pid 1819664] execve("./snoozy.sh", ["./snoozy.sh"], [/* 42 vars */] <unfinished ...>
[pid 1819663] <... wait4 resumed> 0x7ffd72bfef4c, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
[pid 1819664] <... execve resumed> )    = -1 ENOENT (No such file or directory)
[pid 1819637] +++ killed by SIGSEGV +++
bensherman commented 3 years ago

I think I have isolated this to a tmux problem with the remain-on-exit flag: https://github.com/tmux/tmux/issues/2828

DarthSim commented 3 years ago

Thanks for digging into it! Awesome job!

bensherman commented 3 years ago

Looks like there is a fix in the works for tmux!

bensherman commented 2 years ago

Still waiting for the 3.3 release of tmux - https://github.com/tmux/tmux/issues/2880 any minute now!

bensherman commented 2 years ago

tmux 3.3a is out and the issue is gone! People running 3.2 may still hit this - hopefully you are able to upgrade or downgrade tmux from 3.2