Issues with reload - Githubissues

georgexsh commented 12 years ago

config errors if something is wrong in config file, master would exit immediately, leveling workers out there then workers have to be killed manually it would be more robust if master could always kill all workers while exiting this module may be worth referring to: https://github.com/douban/CaoE
kill old workers by their age hmm, since master knows which group of workers is old, why don't kill them explicitly? IMHO depending on workers' age sounds like kind of hacking ...
killing workers not safe during reload, Aribtter.reload() would try to kill workers by calling manage_workers() hoverver when manage_workers() is called in main loop, still try to kill same worker pid even if that worker already exited but that pid is pretty possible has been assigned to another process, which not belongs to gunicorn at all as a result, an innocent process may receive QUIT signal. full log pasted here https://gist.github.com/2971050

benoitc commented 12 years ago

On error the master exit and try to kill the worker if it can. Also if the master died, the worker will check it most of the time and stop themselves. The module who refers won't help at all since it won't catch brutal exit anyway.
not sure what you mean here. On reload , we are launching new workers with the new code while waiting that the other died. We are killing them in order. This is a graceful reload.
If the worker died the master will detect it most of the time and relaunch one if the maximum number of workers is OK. If on reload (I guess you mean HUP) the worker exited then it's totally possible to kill an innocent PID but I doubt it's really possible here. Between the time you launch a reload and new workers are launched I really doubt it is possible that the system launch a new worker with the same PID. the OS is supposed to prevent that.

Anyway have you any reproducible case in mind for the first 2 points? Any specific issue in mind?

For the third one I guess that instead of managing immediately the workers we could try to detect murdered workers. which would make the solution slightly better.

georgexsh commented 12 years ago

if config has error, sending HUP to master will make master exit immediately worker would hold socket for a while, which could block restart gunicorn I didn't see workers exit themselves in this scenario.
if my understanding is correct, old workers are killed by calling self.manage_workers() at the end of Aribtter.reload() in manage_workers, it just check age of worker to see whether to kill it I think, pids of old worker can be recored before spawn, and kill them explicitly. this one should not consider as bug, but a have to have improvement :)
kill an innocent pid is possible making assumption that OS will not recycle dead pid very quickly sounds like not very robust I once saw a error msg says bad perm while kill worker, and I think this issue is the reason.

benoitc commented 12 years ago

The worker doesn't hold the socket. If the master is exiting the worker will detect it after a while. This is expected. The master is the one that manage workers afterall and it is only doing this job.
not sure to follow: we know in which order the worker are launched due to the age. (which is done as spawn). Sound like what you're describing.
This is not an assumption, under unix (except openbsd) PIDs are allocated on a sequenntial basic beginning at 0 until the limit is reached. Then it restart from 0. Killing a recycled will barely happen then. How much often do you think a worker you own is killed before the manager try to manage them? Anyway we could eventually kill them by only checking if they are the master children. Will have a look on it.

georgexsh commented 12 years ago

The worker doesn't hold the socket

oh, then something strange happened here --- 5462 1 0 91814 92272 2 19:48 ? 00:00:02 gunicorn: worker then lsof -p 5462 | grep LISTEN outputs gunicorn: 5462 --- 7u IPv4 114207280 0t0 TCP *:9460 (LISTEN)

error log:

2012-06-22 19:53:17 [23787] [ERROR] Connection in use: ('0.0.0.0', 9460)
2012-06-22 19:53:17 [23787] [ERROR] Retrying in 1 second.

benoitc commented 12 years ago

On Fri, Jun 22, 2012 at 1:58 PM, georgexsh reply@reply.github.com wrote:

The worker doesn't hold the socket

oh, then something strange happened here --- 5462 1 0 91814 92272 2 19:48 ? 00:00:02 gunicorn: worker then lsof -p 5462 | grep LISTEN outputs gunicorn: 5462 --- 7u IPv4 114207280 0t0 TCP *:9460 (LISTEN)

error log:
2012-06-22 19:53:17 [23787] [ERROR] Connection in use: ('0.0.0.0', 9460)
2012-06-22 19:53:17 [23787] [ERROR] Retrying in 1 second.

Ah here it is still accepting on that which stop after the timeout. it should be closed on exit. But the one that was listening on was the master.

benoit

georgexsh commented 12 years ago

@benoitc on my understanding, master was listening, then woker inherit the socket though workers could exit after a while, I still think this is not very friendly

benoitc commented 12 years ago

If the master is killed using SIGQUIT everything will be allright. The only situation where that could happen is when the master is brutally killed either someone send a SIGKILL signal or when there are uncatched errors. If this still happen the workers will detetect after max timeout / 2 seconds if the master died and then quit. There are no way to manage SIGKILL.

georgexsh commented 12 years ago

Yeah here is not about SIGKILL but config file error which coule be better hanled, What about to add a sys.atexit hook to kill workers?

On Friday, June 22, 2012, Benoit Chesneau wrote:

If the master is killed using SIGQUIT everything will be allright. The only situation where that could happen is when the master is brutally killed either someone send a SIGKILL signal or when there are uncatched errors. If this still happen the zorkers will detetect after max timeout / 2 seconds if the master died and then quit. There are no way to manage SIGKILL<

Reply to this email directly or view it on GitHub: https://github.com/benoitc/gunicorn/issues/371#issuecomment-6506395

Xie Shi http://xerr.net/ http://www.oiegg.com/ http://www.douban.com/people/temp/

benoitc commented 12 years ago

using sys.atexit won't change anything I think. Since the vm crashed it won't be able to gracefully kill all workers.

georgexsh commented 12 years ago

@benoitc as aforementioned, not VM crashing or SIGKILL, but config file error, which could be handled by kill workers.

benoitc commented 12 years ago

I don't follow.. if you kill the vm using SIGKILL, atexit won't be executed at all since the vm won't even noticed it has been killed... Or do you mean something other?

georgexsh commented 12 years ago

@benoitc config file error won't trigger SIGKILL, right?

tilgovi commented 10 years ago

We could prevent the possibility of killing a process that isn't a worker by handling sigchld and cleaning up the worker before waking the main thread.

tilgovi commented 10 years ago

About config errors: the last thing the arbiter does is spawn and kill workers. All we have to do is try/except in the reload method so we leave the old workers alive.

benoitc commented 10 years ago

@tilgovi that could work for the config errors by wrapping https://github.com/benoitc/gunicorn/blob/master/gunicorn/arbiter.py#L387 in a try...except block.

The next step is to make sure to not kill a worker if the updated app is bugged. Actually we are launching new workers and right after it we manage the killed workers:

https://github.com/benoitc/gunicorn/blob/master/gunicorn/arbiter.py#L417-L421

The only problem here is when the app is crashing after a delay. Not sure what to do in that case. Thoughts?

tilgovi commented 10 years ago

Might be not this issue at that point. We have the other issue about flapping due to application errors. On Feb 24, 2014 12:49 AM, "Benoit Chesneau" notifications@github.com wrote:

@tilgovi https://github.com/tilgovi that could work for the config errors by wrapping https://github.com/benoitc/gunicorn/blob/master/gunicorn/arbiter.py#L387in a try...except block.

The next step is to make sure to not kill a worker if the updated app is bugged. Actually we are launching new workers and right after it we manage the killed workers:

https://github.com/benoitc/gunicorn/blob/master/gunicorn/arbiter.py#L417-L421

The only problem here is when the app is crashing after a delay. Not sure what to do in that case. Thoughts?

— Reply to this email directly or view it on GitHubhttps://github.com/benoitc/gunicorn/issues/371#issuecomment-35867020 .

tilgovi commented 10 years ago

For the config errors, we have another issue: #568 The way we decide which workers are old is fine. Age == oldness. But now we are safer about reaping.

benoitc / gunicorn

Issues with reload #371