circus-tent / circus

A Process & Socket Manager built with zmq
http://circus.readthedocs.org/
Other
1.55k stars 257 forks source link

Flapping plugin breaks circusctl stop with numprocesses > 1 #965

Open mwicat opened 8 years ago

mwicat commented 8 years ago

For the following config (v0.13.0):

[circus]
check_delay = 5
endpoint = tcp://127.0.0.1:5555
pubsub_endpoint = tcp://127.0.0.1:5556
stats_endpoint = tcp://127.0.0.1:5557
httpd = False
;include_dir = /etc/circus/conf.d
;debug = True

[plugin:flapping]
use = circus.plugins.flapping.Flapping
retry_in = 3
max_retry = 2

[watcher:socat]
numprocesses = 4
cmd = socat
args = open:/dev/zero open:/dev/null

Looks like when numprocesses > 1 and flapping is enabled, we cannot successfully stop the service because flapping keep starting it again. When we disable flapping or set numprocesses = 1 circusctl stop behaves ok.

trying to stop service, it's started again:

[~] 2016-03-08 11:53:43# circusctl status
circusd-stats: active
plugin:flapping: active
socat: active
[~] 2016-03-08 11:53:48# circusctl stop socat
ok
[~] 2016-03-08 11:53:57# circusctl status
circusd-stats: active
plugin:flapping: active
socat: stopped
[~] 2016-03-08 11:53:58# circusctl status
circusd-stats: active
plugin:flapping: active
socat: active

disabling flapping:

[~] 2016-03-08 11:54:01# circusctl stop plugin:flapping
ok
[~] 2016-03-08 11:54:15# circusctl status
circusd-stats: active
plugin:flapping: stopped
socat: active
[~] 2016-03-08 11:54:19# circusctl stop socat
ok
[~] 2016-03-08 11:54:23# circusctl status
circusd-stats: active
plugin:flapping: stopped
socat: stopped
[~] 2016-03-08 11:54:31# circusctl status
circusd-stats: active
plugin:flapping: stopped
socat: stopped
mwicat commented 8 years ago

@k4nar Can you reproduce it?

mwicat commented 8 years ago

I found the cause of this bug:

These are logs in circusd produced by "circusctl stop socat" when numprocesses = 1:

2016-03-30 12:10:11 circus[31949] [DEBUG] Received an event from circusd: ['watcher.socat.kill', '{"process_pid":31948,"time":1459332611.954768}']
2016-03-30 12:10:11 circus[31949] [DEBUG] Removing 31948 from socat
2016-03-30 12:10:11 circus[31949] [DEBUG] Stopping the periodic callback for socat
2016-03-30 12:10:12 circus[31940] [INFO] socat stopped
2016-03-30 12:10:12 circus[31949] [DEBUG] Received an event from circusd: ['watcher.socat.reap', '{"exit_code":143,"process_pid":31948,"time":1459332612.055073}']
2016-03-30 12:10:12 circus[31949] [DEBUG] Received an event from circusd: ['watcher.socat.stop', '{"time":1459332612.055173}']

These are logs in circusd produced by "circusctl stop socat" when numprocesses = 2:

2016-03-30 12:09:30 circus[31898] [DEBUG] Received an event from circusd: ['watcher.socat.kill', '{"process_pid":31896,"time":1459332570.500174}']
2016-03-30 12:09:30 circus[31898] [DEBUG] Removing 31896 from socat
2016-03-30 12:09:30 circus[31898] [DEBUG] Received an event from circusd: ['watcher.socat.kill', '{"process_pid":31897,"time":1459332570.500499}']
2016-03-30 12:09:30 circus[31898] [DEBUG] Removing 31897 from socat
2016-03-30 12:09:30 circus[31898] [DEBUG] Stopping the periodic callback for socat
2016-03-30 12:09:30 circus[31888] [INFO] socat stopped
2016-03-30 12:09:30 circus[31895] [INFO] socat: flapping detected: retry in  3s (attempt number 1)
2016-03-30 12:09:30 circus[31898] [DEBUG] Received an event from circusd: ['watcher.socat.reap', '{"exit_code":143,"process_pid":31896,"time":1459332570.600711}']
2016-03-30 12:09:30 circus[31898] [DEBUG] Received an event from circusd: ['watcher.socat.reap', '{"exit_code":143,"process_pid":31897,"time":1459332570.600817}']
2016-03-30 12:09:30 circus[31898] [DEBUG] Received an event from circusd: ['watcher.socat.stop', '{"time":1459332570.600865}']
2016-03-30 12:09:33 circus[31898] [DEBUG] Received an event from circusd: ['watcher.socat.spawn', '{"process_pid":31918,"time":1459332573.60254}']
2016-03-30 12:09:33 circus[31898] [DEBUG] Starting the periodic callback for socat
2016-03-30 12:09:33 circus[31898] [DEBUG] Adding 31918 in socat
2016-03-30 12:09:33 circus[31898] [DEBUG] Received an event from circusd: ['watcher.socat.spawn', '{"process_pid":31919,"time":1459332573.605919}']
2016-03-30 12:09:33 circus[31898] [DEBUG] Adding 31919 in socat
2016-03-30 12:09:33 circus[31888] [INFO] socat started
2016-03-30 12:09:33 circus[31898] [DEBUG] Received an event from circusd: ['watcher.socat.start', '{"time":1459332573.608459}']
2016-03-30 12:09:33 circus[31898] [DEBUG] Unknown action: 'start'
2016-03-30 12:09:33 circus[31898] [DEBUG] {'time': 1459332573.608459}

The problem lies in https://github.com/circus-tent/circus/blob/master/circus/plugins/flapping.py#L108 :

By default self._get_conf(conf, 'attempts') == 2 because config variable attempts is 2 (http://circus.readthedocs.org/en/latest/for-ops/using-plugins/#flapping)

Now, for every watched process that is killed (no matter if on purpose with circusctl or by accident) circus generates "reap" event.

Flapping plugin gathers every "reap" event that happened in time window in timeline array and when len(timeline) == attempts then it marks watcher as flapping and schedules it to start after _retryin seconds.

So the case of stopping watcher with numprocesses == 1 works only because there is only one reap event generated in time window, but for numprocesses == n there are n reap events generated and this causes flapping.py to mark watcher as flapping because in most cases n > attempts

The workaround is to set attempts so that its bigger than numprocesses but this behaviour is clearly a bug because flapping.py shouldn't take processes stopped by user under consideration when computing if watcher is in flapping state or not.

k4nar commented 8 years ago

Thank you very much for this very thorough explanation. I didn't have much time to search the cause myself.

flapping.py shouldn't take processes stopped by user under consideration when computing if watcher is in flapping state or not.

Yes, I think this would be the best solution. Would you be up for trying to tackle this one in a PR ? :)