airbnb / hypernova

A service for server-side rendering your JavaScript views
MIT License
5.82k stars 207 forks source link

Ensure worker processes shut down during the shutdown sequence #114

Closed schleyfox closed 6 years ago

schleyfox commented 6 years ago

This uses signals to kill processes that remain when shutting down the coordinator times out. We observed that the usual mechanisms for ensuring child processes go down (the kill message and the internal cluster disconnect event failsafe) are handled via the worker's event loop. If the worker is stuck in an infinite loop, it may never terminate. This can leave zombie processes hanging around leading to memory and CPU exhaustion.

This also changes the logic to prevent trying to respawn workers when processes have non-zero exits while the coordinator is already closing.

I tested this with workers that were put into an infinite loop and additionally which blocked the SIGTERM signal by installing a handler. You can see the output from one of these below:

2018-04-05T21:50:27.010Z - info: Worker #7 shutting down.
2018-04-05T21:50:31.997Z - info: Closing the coordinator took too long. 
{ timeout: 5000 }
2018-04-05T21:50:31.999Z - info: Coordinator killing 1 live workers with SIGTERM
2018-04-05T21:50:34.002Z - info: Killing workers with SIGTERM took too long 
{ timeout: 2000 }
2018-04-05T21:50:34.003Z - info: Coordinator killing 1 live workers with SIGKILL
2018-04-05T21:50:34.007Z - info: Worker #3 has disconnected
2018-04-05T21:50:34.008Z - error: Worker #3 died with code SIGKILL during close. Not restarting.

/cc @goatslacker @ljharb @martinwin