Circusd unresponsive when cmd is wrong and max_retry = -1

jeroenhe commented 3 years ago

Problem description

In short: circusd version 0.17.1 becomes unresponsive when a cmd is wrongly defined inside the applications configuration and also has max_retry set to -1.

What does happen When starting circusd with the above situation present:

circusd starts outputting warnings at an alarming high rate: circus[1] [WARNING] error in 'app3': [Errno 2] No such file or directory: '/srv/mine/app3-broken': '/srv/mine/app3-broken'
circusd becomes unresponsive to queries from circusctl
other configured applications that where not yet started, won't get started

What I'd like to happen instead

circusd stays responsive to queries from circusctl
circusd keeps trying to start my app with the watcher, but with a small configurable delay, so it doesn't get overloaded.

What about settings max_retry to something else than -1? A possible workaround could be to set max_retry set to (say) 5. This will stop the problem of circusd becoming unresponsive, but it will also cause the process to not be started when it would have started normally (given a correct cmd) and stopped 5 times in a row for other reasons.

Reproducing the issue I have created a proof of concept for the issue so it can be easily reproduced. Instructions on running it are in the README. I hope this helps.

MFlossmann commented 3 years ago

Apart from the general question of “Should circus be ‘taken hostage’ by faulty commands?”: Would the Flapping Plugin solve your issue?

jeroenhe commented 3 years ago

Apart from the general question of “Should circus be ‘taken hostage’ by faulty commands?”: Would the Flapping Plugin solve your issue?

Thank you for your reply.

In my proof of concept I have already made use of the flapping plugin, but this doesn't solve the issue of circusd becoming unavailable. The config related to the flapping plugin looks like this:

[plugin:flapping]
use = circus.plugins.flapping.Flapping
# the number of times a process can restart, within window seconds, before we consider it flapping (default: 2)
attempts = 2
# the time window in seconds to test for flapping. If the process restarts more than attempts times within this time window, we consider it a flapping process. (default: 1)
window = 60
# the number of times we attempt to start a process that has been flapping, before we abandon and stop the whole watcher. (default: 5) Set to -1 to disable max_retry and retry indefinitely.
max_retry = -1
# time in seconds to wait until we try to start again a process that has been flapping. (default: 7)
retry_in = 7

If there is something I can change in this configuration to prevent circusd from becoming unresponsive, please tell me :)

I understand your "should not be taken hostage" feeling, but for us it does happen. For example, we have lots of java applications, but sometimes we accidentally deploy applications to a server that doesn't have the correct referenced java runtime yet, which then causes mayhem for all circusd managed processes on the server. It's hard to prevent such mistakes, and they are mostly easily and quickly repaired, but it adds up that none of the "other" circusd managed services are managed any more (and circusd requires a restart, not a reload).

circus-tent / circus

Circusd unresponsive when cmd is wrong and max_retry = -1 #1157