DarthSim / overmind

Process manager for Procfile-based applications and tmux
MIT License
2.86k stars 81 forks source link

Overmind does not detect crashed process #182

Open magJ opened 4 months ago

magJ commented 4 months ago

I'm using overmind to run three processes, one of the processes, "api" a nodejs process ran out of memory and crashed. However overmind still thinks that it's running.

app-user@machine:/app$ overmind ps
PROCESS   PID       STATUS
nginx     341       running
worker    343       running
api       346       running
app-user@machine:/app$ ps aux|grep 346
app-user   346  0.0  0.0      0     0 ?        Zs   May09   0:00 [sh] <defunct>
app-user  1092  0.0  0.1   3328  1608 pts/2    S+   01:26   0:00 grep 346

It looks like the app process id "346" has become a zombie, but overmind has not detected it.

Overmind version: 2.4.0 Operating system: Debian bookworm, based off the docker image node:20.11.1-bookworm-slim, and running on fly.io

This issue happened on two different machines, but I'm really struggling to reproduce it. It might be a tmux issue, sounds similar to this https://github.com/tmux/tmux/issues/311 issue, but I really don't know.

zhangcheng commented 4 months ago

I ran into the same issue from time to time. Happened on earlier version of overmind, upgraded to latest 2.5.1 recently, still happening. I think the zombie process is the shell process, which in turns run the app process.

magJ commented 4 months ago

I spent a day trying to debug this issue without much success, I suspect that it's a actually a tmux bug, but I haven't been able to figure out a reliable way to reproduce it.

DarthSim commented 4 months ago

Hey there,

This definitely a bug of tmux not handling SIGCHLD properly.

From the Overmind's point of view, the process is still running since Overmind can send signals to it. The only way to check if a process is in the zombie state is to read its state file or to use the ps command. Both ways aren't pretty good to use them with short intervals. And I believe that it's not an imgproxy duty to kill zombies.

The walkaround proposed in https://github.com/tmux/tmux/issues/311 should theoretically work: prepend your commands with trap 'pkill -CHLD tmux' 0; or trap 'pkill -CHLD tmux' EXIT;.

To be honest, Overmind was never meant to run in production, it was developed mostly as a dev tool.