Supervisor / supervisor

Supervisor process control system for Unix (supervisord)
http://supervisord.org
Other
8.53k stars 1.25k forks source link

KillMode in systemd unit of Supervisor #1650

Closed NielsH closed 4 months ago

NielsH commented 4 months ago

Hello,

The default KillMode of Supervisor within its Debian package is KillMode=process. The man page of Systemd says:

       KillMode=
           Specifies how processes of this unit shall be killed. One of control-group, mixed, process, none.

           If set to control-group, all remaining processes in the control group of this unit will be killed on unit stop (for services: after the stop command is executed, as configured with ExecStop=). If set to mixed, the SIGTERM signal (see below) is sent to the main process while the subsequent SIGKILL signal (see below) is sent to all remaining
           processes of the unit's control group. If set to process, only the main process itself is killed (not recommended!). If set to none, no process is killed (strongly recommended against!). In this case, only the stop command will be executed on unit stop, but no process will be killed otherwise. Processes remaining alive after stop are left
           in their control group and the control group continues to exist after stop unless empty.

           Note that it is not recommended to set KillMode= to process or even none, as this allows processes to escape the service manager's lifecycle and resource management, and to remain running even while their service is considered stopped and is assumed to not consume any resources.

           Processes will first be terminated via SIGTERM (unless the signal to send is changed via KillSignal= or RestartKillSignal=). Optionally, this is immediately followed by a SIGHUP (if enabled with SendSIGHUP=). If processes still remain after the main process of a unit has exited or the delay configured via the TimeoutStopSec= has passed,
           the termination request is repeated with the SIGKILL signal or the signal specified via FinalKillSignal= (unless this is disabled via the SendSIGKILL= option). See kill(2) for more information.

           Defaults to control-group

Because of this, we are seeing that Supervisor-managed processes that do not respond to a SIGTERM and that have a stopwaitsecs exceeding the systemd value TimeoutStopSec (by default 90 seconds) will remain lingering indefinitely. I.e. the Laravel recommended configuration for their workers has a stopwaitsecs value of 3600.

This is because Systemd does a SIGKILL of the supervisor process after 90 seconds. Due to KillMode=process, its forks remain running and because Supervisor no longer runs it cannot kill the running processes anymore after the stopwaitsecs timeout is exceeded.

This is also logged by systemd:

Jul  4 17:30:05 <snip> systemd[1]: supervisor-snip.service: Unit process 1171111 (php8.2) remains running after unit stopped.
Jul  4 20:28:06 <snip> systemd[1]: supervisor-snip.service: Unit process 1033640 (php8.2) remains running after unit stopped.
Jul  4 20:28:06 <snip> systemd[1]: supervisor-snip.service: Unit process 1033646 (php8.2) remains running after unit stopped.
Jul  4 20:28:07 <snip> systemd[1]: supervisor-snip.service: Unit process 1030328 (php8.2) remains running after unit stopped.

We are considering changing the Systemd KillMode to the default of control-group, which does resolve the issue. However because the default is process, we would like to ask the Supervisor developers for the rationale of having this value on process, despite it not being recommended by Systemd.

Perhaps there is a reason we did not think of in which case we'd like to know before changing it on our side.

I am aware that this post may be better suited on a debian-specific mailing list, however since it did not seem very active I choose to post my question here hoping that someone is able to give any insights.

Thank you!

mnaberez commented 4 months ago

This is because Systemd does a SIGKILL of the supervisor process after 90 seconds.

Sending SIGKILL to supervisord is undesirable because supervisord will not be able to finish writing the log files, will not be able to clean up any temporarily files it has created, and any child processes supervisord spawned will be orphaned.

Because of this, we are seeing that Supervisor-managed processes that do not respond to a SIGTERM and that have a stopwaitsecs exceeding the systemd value TimeoutStopSec (by default 90 seconds) will remain lingering indefinitely. I.e. the Laravel recommended configuration for their workers has a stopwaitsecs value of 3600.

When supervisord receives a request to exit (by signal or supervisorctl command), it will send the stopsignal to each of its child processes. It will wait for all of its child processes to exit, then it will exit. During this time, supervisord will log messages like waiting for <processname> to stop. If stopwaitsecs is set to 3600 seconds for a process (one hour), then supervisord will wait up to one hour for the process to exit on its own. After stopwaitsecs has elapsed, supervisord will terminate the process with SIGKILL and then it will finally exit.

Consider configuring the system such that supervisord has the opportunity to exit cleanly: increase TimeoutStopSec to be larger than the largest stopwaitsecs or decrease all stopwaitsecs to be less than TimeoutStopSec.

We are considering changing the Systemd KillMode to the default of control-group, which does resolve the issue. However because the default is process, we would like to ask the Supervisor developers for the rationale of having this value on process, despite it not being recommended by Systemd.

The Supervisor project only publishes Python packages to PyPI and these do not contain integrations with any operating system (init scripts, unit files, etc). The integration with Systemd you are using was created by others who are not part of the Supervisor project itself.