isolate circusd from watcher/env issues

davidbirdsong commented 11 years ago

In experimenting w/ hooks for a watcher, I've noticed that a hook that references an un-importable module will cause all of circusd to crash upon restart.

Shouldn't circusd be able to start up based only on what's found in the [circus] config section? If I push a bad watcher config, I use includes heavily, I want to know that everything else hanging off of circusd will be protected from a bad watcher config.

tarekziade commented 11 years ago

that makes total sense. imho the best thing we can do is to raise an error on start up and abort. what do you think?

davidbirdsong commented 11 years ago

yeah, possibly add the watcher to the list of watchers, but record it's state as parse_failed or some other expression that it's config caused a non-recoverable error so that "circusctl watcher list" shows it in an error state.

tarekziade commented 11 years ago

should we really run at all if a watcher config is busted ?

davidbirdsong commented 11 years ago

I think so. Why penalize all the other watchers that are perfectly not-busted?

Also, why penalize circusd itself? Shouldn't the master circus process always be able to reach a consistent running state?

tarekziade commented 11 years ago

Why penalize all the other watchers that are perfectly not-busted?

I am thinking: a stack that's running using Circus probably wants to have all its processes running. How can you know if your stack is valid if some of the workers are not working

Shouldn't the master circus process always be able to reach a consistent running state?

the failure you are describing happens when you start circusd only, right ? so if circusd never starts because the config is busted, raising an error pointing the faulty hook, you know you have to fix it, no ?

davidbirdsong commented 11 years ago

I am thinking: a stack that's running using Circus probably wants to have all its processes running. How can you know if your stack is valid if some of the workers are not working

My preferred method is via circusctl and not an error in a log. I expect to leverage the pub/sub parts of circus later once I'm more familiar. It's been my observation that circusctl will timeout when circus is inside of a long-running hook.

I agree that there are plenty of use-cases where the entire stack should be running, but there are other uses where that's not as true.

Imagine an HDFS cluster with many datanodes and one namenode. It's not unreasonable to run datanodes on all nodes in a set of node1-5 and then run the namenode on node1. Let's say that I'd like the datanode watcher on node1 to pick up a config change. In doing a full circus restart on node1, a bad datanode config could block the (re)starting of the namenode which means an entire HDFS cluster is down due to a bad, but unrelated config file.

With the addition of relaodconfig, full circus restarts are not necessary as often since a change to the aforementioned datanode config could be realized without needing a full circus restart, but now I'm curious how reloadconfig handles bad watcher configs and how other watchers/circus are affected in this case.

the failure you are describing happens when you start circusd only, right ? so if circusd never starts because the config is busted, raising an error pointing the faulty hook, you know you have to fix it, no ?

I'd rather discover programmatically via circusctl or through the api that a watcher is in a failed state. When I launch or restart circus through runit, I'm only aware of a bad config by experiencing a timeout when interacting via circusctl. Then I check the log to see if it indicates which watcher caused circusd to be in an unresponsive state. I'd much rather be able to pinpoint with circusctl status.

I'll admit that I'm still new to circus and I keep looking through the lens of someone who relied on and leveraged the well-known states of supervisord, the daemon, and the child processes that it managed. The features in circus are tre cool and have allowed me to envision some seriously cool service deployment plumbing.

Perhaps sharing what I'm intending to do would be a better illustration of my concerns.

I like that hooks can provide a wrapping around dumb programs. I'd like to leverage them to provide a poor man's service discovery or registry for programs that would otherwise be prohibitively cumbersome to hook into something like zookeeper. It appears that circus directly imports and runs the python code that I supply as a hook and it concerns me that there's little protection from me writing a bad hook or not considering all edge cases and messing up the core circusd runtime. I can wrap everything in a big try/except block, but I'd much rather count on some sort of isolation native to circus if for no other reason than the fact that circus committers adhere to testing and review and therefore emit higher quality code than I.

Natim commented 11 years ago

I got your point.

In that case I will not use hooks for that but create another daemon that will be plugged on zmq endpoints and will do the zoo keeper without any risk to break anything. Most of the code you will need already exists for that.

We are doing a Sprint about the clustering management with circus on July 8th/9th. I look forward to hear your needs about that. Le 19 juin 2013 00:05, "david birdsong" notifications@github.com a écrit :

I am thinking: a stack that's running using Circus probably wants to have all its processes running. How can you know if your stack is valid if some of the workers are not working

My preferred method is via circusctl and not an error in a log. I expect to leverage the pub/sub parts of circus later once I'm more familiar. It's been my observation that circusctl will timeout when circus is inside of a long-running hook.

I agree that there are plenty of use-cases where the entire stack should be running, but there are other uses where that's not as true.

Imagine an HDFS cluster with many datanodes and one namenode. It's not unreasonable to run datanodes on all nodes in a set of node1-5 and then run the namenode on node1. Let's say that I'd like the datanode watcher on node1 to pick up a config change. In doing a full circus restart on node1, a bad datanode config could block the (re)starting of the namenode which means an entire HDFS cluster is down due to a bad, but unrelated config file.

With the addition of relaodconfig, full circus restarts are not necessary as often since a change to the aforementioned datanode config could be realized without needing a full circus restart, but now I'm curious how reloadconfig handles bad watcher configs and how other watchers/circus are affected in this case.

the failure you are describing happens when you start circusd only, right ? so if circusd never starts because the config is busted, raising an error pointing the faulty hook, you know you have to fix it, no ?

I'd rather discover programmatically via circusctl or through the api that a watcher is in a failed state. When I launch or restart circus through runit, I'm only aware of a bad config by experiencing a timeout when interacting via circusctl. Then I check the log to see if it indicates which watcher caused circusd to be in an unresponsive state. I'd much rather be able to pinpoint with circusctl status.

I'll admit that I'm still new to circus and I keep looking through the lens of someone who relied on and leveraged the well-known states of supervisord, the daemon, and the child processes that it managed. The features in circus are tre cool and have allowed me to envision some seriously cool service deployment plumbing.

Perhaps sharing what I'm intending to do would be a better illustration of my concerns.

I like that hooks can provide a wrapping around dumb programs. I'd like to leverage them to provide a poor man's service discovery or registry for programs that would otherwise be prohibitively cumbersome to hook into something like zookeeper. It appears that circus directly imports and runs the python code that I supply as a hook and it concerns me that there's little protection from me writing a bad hook or not considering all edge cases and messing up the core circusd runtime. I can wrap everything in a big try/except block, but I'd much rather count on some sort of isolation native to circus if for no other reason than the fact that circus committers adhere to testing and review and therefore emit higher quality code than I.

— Reply to this email directly or view it on GitHubhttps://github.com/mozilla-services/circus/issues/446#issuecomment-19645948 .

tarekziade commented 11 years ago

It's been my observation that circusctl will timeout when circus is inside of a long-running hook.

Yeah, hooks needs to be fast,. Notice that plugins OTOH are separate processes. It's more adapted for long running tasks

With the addition of relaodconfig, full circus restarts are not necessary as often since a change to the aforementioned datanode config could be realized without needing a full circus restart, but now I'm curious how reloadconfig handles bad watcher configs and how other watchers/circus are affected in this case.

it currently does not since we're not loading the plugin code until it's called.

here's my proposal: let's add the feature you're mentioning but with a "strict" flag we can add in [circus].

When this flag is present, circusd will refuse to start on import errors. on reloads, if hooks are changed, circusd will refuse to apply the change if one plugin is busted.

When this flag is not present, to hook is marked as busted, a warning is sent, and it's ignored going forward.

I propose that the strict mode is the default because I expect that people adding plugins will not programmatically check for their status.

tarekziade commented 11 years ago

I did not have time to do this one but we'll definitely tackle it for 1.0

Natim commented 10 years ago

I had a similar issue today and I think we should be able to configure a hook path from which to load them so that circus can do:

import site
site.addsitedir(circusd_hook_path)

SEJeff commented 10 years ago

I've been hit by this exact same bug when using celery + raven. All of our django apps use the most wonderful sentry for logging all errors, in each application's circus.ini (we keep them in the same repo as the actual app and do a reloadconfig with each deploy, we have a section such as:

[watcher:appname_watcher]
...
hooks.after_start = appname.hooks.run_raven

When appname/init.py imported celery, all was well, except when we restarted circusd. Then it failed and all hell broke loose. This was the exact same issue.

We have many different apps hosted on each appserver and jenkins deploys them all + runs them all under 1 circusd. We do not want the failure of a single app to take down the entire circusd. Doing so kind of defeats the purpose of a circusd as a process supervisord.

TL;DNR: I 100% agree with @davidbirdsong in that this is a very bad behavior that should be fixed

circus-tent / circus

isolate circusd from watcher/env issues #446