Closed georgexsh closed 10 years ago
Anyway have you any reproducible case in mind for the first 2 points? Any specific issue in mind?
For the third one I guess that instead of managing immediately the workers we could try to detect murdered workers. which would make the solution slightly better.
self.manage_workers()
at the end of Aribtter.reload()
in manage_workers
, it just check age
of worker to see whether to kill it
I think, pids of old worker can be recored before spawn
, and kill them explicitly.
this one should not consider as bug, but a have to have improvement :)The worker doesn't hold the socket
oh, then something strange happened here
--- 5462 1 0 91814 92272 2 19:48 ? 00:00:02 gunicorn: worker
then lsof -p 5462 | grep LISTEN
outputs
gunicorn: 5462 --- 7u IPv4 114207280 0t0 TCP *:9460 (LISTEN)
error log:
2012-06-22 19:53:17 [23787] [ERROR] Connection in use: ('0.0.0.0', 9460)
2012-06-22 19:53:17 [23787] [ERROR] Retrying in 1 second.
On Fri, Jun 22, 2012 at 1:58 PM, georgexsh reply@reply.github.com wrote:
The worker doesn't hold the socket
oh, then something strange happened here
--- 5462 1 0 91814 92272 2 19:48 ? 00:00:02 gunicorn: worker
thenlsof -p 5462 | grep LISTEN
outputsgunicorn: 5462 --- 7u IPv4 114207280 0t0 TCP *:9460 (LISTEN)
error log:
2012-06-22 19:53:17 [23787] [ERROR] Connection in use: ('0.0.0.0', 9460) 2012-06-22 19:53:17 [23787] [ERROR] Retrying in 1 second.
Ah here it is still accepting on that which stop after the timeout. it should be closed on exit. But the one that was listening on was the master.
@benoitc on my understanding, master was listening, then woker inherit the socket though workers could exit after a while, I still think this is not very friendly
If the master is killed using SIGQUIT everything will be allright. The only situation where that could happen is when the master is brutally killed either someone send a SIGKILL signal or when there are uncatched errors. If this still happen the workers will detetect after max timeout / 2 seconds if the master died and then quit. There are no way to manage SIGKILL.
Yeah here is not about SIGKILL but config file error which coule be better hanled, What about to add a sys.atexit hook to kill workers?
On Friday, June 22, 2012, Benoit Chesneau wrote:
If the master is killed using SIGQUIT everything will be allright. The only situation where that could happen is when the master is brutally killed either someone send a SIGKILL signal or when there are uncatched errors. If this still happen the zorkers will detetect after max timeout / 2 seconds if the master died and then quit. There are no way to manage SIGKILL<
Reply to this email directly or view it on GitHub: https://github.com/benoitc/gunicorn/issues/371#issuecomment-6506395
Xie Shi http://xerr.net/ http://www.oiegg.com/ http://www.douban.com/people/temp/
using sys.atexit won't change anything I think. Since the vm crashed it won't be able to gracefully kill all workers.
@benoitc as aforementioned, not VM crashing or SIGKILL, but config file error, which could be handled by kill workers.
I don't follow.. if you kill the vm using SIGKILL, atexit won't be executed at all since the vm won't even noticed it has been killed... Or do you mean something other?
@benoitc config file error won't trigger SIGKILL, right?
We could prevent the possibility of killing a process that isn't a worker by handling sigchld
and cleaning up the worker before waking the main thread.
About config errors: the last thing the arbiter does is spawn and kill workers. All we have to do is try/except in the reload method so we leave the old workers alive.
@tilgovi that could work for the config errors by wrapping https://github.com/benoitc/gunicorn/blob/master/gunicorn/arbiter.py#L387 in a try...except
block.
The next step is to make sure to not kill a worker if the updated app is bugged. Actually we are launching new workers and right after it we manage the killed workers:
https://github.com/benoitc/gunicorn/blob/master/gunicorn/arbiter.py#L417-L421
The only problem here is when the app is crashing after a delay. Not sure what to do in that case. Thoughts?
Might be not this issue at that point. We have the other issue about flapping due to application errors. On Feb 24, 2014 12:49 AM, "Benoit Chesneau" notifications@github.com wrote:
@tilgovi https://github.com/tilgovi that could work for the config errors by wrapping https://github.com/benoitc/gunicorn/blob/master/gunicorn/arbiter.py#L387in a try...except block.
The next step is to make sure to not kill a worker if the updated app is bugged. Actually we are launching new workers and right after it we manage the killed workers:
https://github.com/benoitc/gunicorn/blob/master/gunicorn/arbiter.py#L417-L421
The only problem here is when the app is crashing after a delay. Not sure what to do in that case. Thoughts?
— Reply to this email directly or view it on GitHubhttps://github.com/benoitc/gunicorn/issues/371#issuecomment-35867020 .
For the config errors, we have another issue: #568 The way we decide which workers are old is fine. Age == oldness. But now we are safer about reaping.
Aribtter.reload()
would try to kill workers by callingmanage_workers()
hoverver whenmanage_workers()
is called in main loop, still try to kill same worker pid even if that worker already exited but that pid is pretty possible has been assigned to another process, which not belongs to gunicorn at all as a result, an innocent process may receive QUIT signal. full log pasted here https://gist.github.com/2971050