Yelp / zygote

A Python HTTP process management utility.
http://opensource.yelp.com/
Apache License 2.0
40 stars 15 forks source link

zygote can deadlock in logging #16

Closed Roguelazer closed 11 years ago

Roguelazer commented 12 years ago

I've seen a few times where a zygote will get "stuck" waiting on the logging semaphore. I thing it might be a signal issue, but I'm not positive. The top of the stack trace looks like:

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86
#1  0x00000000004d4498 in PyThread_acquire_lock (lock=0xf20480, waitflag=128) at ../Python/thread_pthread.h:349
#2  0x00000000004d7932 in lock_PyThread_acquire_lock (self=0x7f21f4850168, args=<value optimized out>) at ../Modules/threadmodule.c:46
#3  0x00000000004a7c5e in call_function (f=
    Frame 0xf3c370, for file /usr/lib/python2.6/threading.py, line 123, in acquire (self=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f21f4850168>, _RLock__count=0) at remote 0x7f21f478b590>, blocking=1, me=139783813953280), throwflag=<value optimized out>) at ../Python/ceval.c:3750
#4  PyEval_EvalFrameEx (f=
    Frame 0xf3c370, for file /usr/lib/python2.6/threading.py, line 123, in acquire (self=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f21f4850168>, _RLock__count=0) at remote 0x7f21f478b590>, blocking=1, me=139783813953280), throwflag=<value optimized out>) at ../Python/ceval.c:2412
#5  0x00000000004a9671 in PyEval_EvalCodeEx (co=0x7f21f47b2eb8, globals=<value optimized out>, locals=<value optimized out>, args=0x2, argcount=<value optimized out>, kws=<value optimized out>, kwcount=0, defs=0x7f21f30b95a8, defcount=1, closure=0x0)
    at ../Python/ceval.c:3000
#6  0x00000000004a7809 in fast_function (f=
    Frame 0xf3c1b0, for file /usr/lib/python2.6/logging/__init__.py, line 622, in acquire (self=<StreamHandler(lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f21f4850168>, _RLock__count=0) at remote 0x7f21f478b590>, formatter=<Formatter(datefmt=None, _fmt='[%(process)d] %(asctime)s :: %(levelname)-7s :: %(name)s - %(message)s') at remote 0xdc4098>, stream=<file at remote 0x7f21f49101e0>, filters=[], level=20) at remote 0xd1ba70>), throwflag=<value optimized out>)
    at ../Python/ceval.c:3846

and the next interesting part is at

#20 PyEval_EvalFrameEx (f=
    Frame 0xf3b2a0, for file /usr/lib/python2.6/logging/__init__.py, line 1165, in _log (self=<Logger(name='zygote.worker.worker_process', parent=<Logger(name='zygote', parent=<RootLogger(name='root', parent=None, handlers=[<NullHandler(lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f21f4850180>, _RLock__count=0) at remote 0x7f21f478b8d0>, formatter=None, filters=[], level=0) at remote 0xdc4128>, <StreamHandler(lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f21f48501c8>, _RLock__count=0) at remote 0xdc3710>, formatter=None, stream=<file at remote 0x7f21f49101e0>, filters=[], level=10) at remote 0xdc7050>], level=10, disabled=0, propagate=1, filters=[]) at remote 0x7f21f47b35a8>, handlers=[<StreamHandler(lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f21f4850168>, _RLock__count=0) at remote 0x7f21f478b590>, formatter=<Formatter(datefmt=None, _fmt='[%(process)d] %(...(truncated), throwflag=<value optimized out>)
    at ../Python/ceval.c:2412
#21 0x00000000004a9671 in PyEval_EvalCodeEx (co=0x7f21f47a46c0, globals=<value optimized out>, locals=<value optimized out>, args=0x6, argcount=<value optimized out>, kws=<value optimized out>, kwcount=0, defs=0x7f21f30be0b0, defcount=2, closure=0x0)
    at ../Python/ceval.c:3000
#22 0x000000000053771d in function_call (func=<function at remote 0x7f21f30c4410>, arg=
    (<Logger(name='zygote.worker.worker_process', parent=<Logger(name='zygote', parent=<RootLogger(name='root', parent=None, handlers=[<NullHandler(lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f21f4850180>, _RLock__count=0) at remote 0x7f21f478b8d0>, formatter=None, filters=[], level=0) at remote 0xdc4128>, <StreamHandler(lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f21f48501c8>, _RLock__count=0) at remote 0xdc3710>, formatter=None, stream=<file at remote 0x7f21f49101e0>, filters=[], level=10) at remote 0xdc7050>], level=10, disabled=0, propagate=1, filters=[]) at remote 0x7f21f47b35a8>, handlers=[<StreamHandler(lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f21f4850168>, _RLock__count=0) at remote 0x7f21f478b590>, formatter=<Formatter(datefmt=None, _fmt='[%(process)d] %(asctime)s :: %(levelname)-7s :: %(name)s - %(message)s') at remote 0xdc4098>, stream=<fil...(truncated), kw={})
    at ../Objects/funcobject.c:524
#23 0x000000000041f0c7 in PyObject_Call (func=<function at remote 0x7f21f30c4410>, arg=<unknown at remote 0x80>, kw=0x0) at ../Objects/abstract.c:2492
#24 0x00000000004a6351 in ext_do_call (f=
    Frame 0xf3b0b0, for file /usr/lib/python2.6/logging/__init__.py, line 1048, in info (self=<Logger(name='zygote.worker.worker_process', parent=<Logger(name='zygote', parent=<RootLogger(name='root', parent=None, handlers=[<NullHandler(lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f21f4850180>, _RLock__count=0) at remote 0x7f21f478b8d0>, formatter=None, filters=[], level=0) at remote 0xdc4128>, <StreamHandler(lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f21f48501c8>, _RLock__count=0) at remote 0xdc3710>, formatter=None, stream=<file at remote 0x7f21f49101e0>, filters=[], level=10) at remote 0xdc7050>], level=10, disabled=0, propagate=1, filters=[]) at remote 0x7f21f47b35a8>, handlers=[<StreamHandler(lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f21f4850168>, _RLock__count=0) at remote 0x7f21f478b590>, formatter=<Formatter(datefmt=None, _fmt='[%(process)d] %(...(truncated), throwflag=<value optimized out>)
    at ../Python/ceval.c:4063
#25 PyEval_EvalFrameEx (f=
    Frame 0xf3b0b0, for file /usr/lib/python2.6/logging/__init__.py, line 1048, in info (self=<Logger(name='zygote.worker.worker_process', parent=<Logger(name='zygote', parent=<RootLogger(name='root', parent=None, handlers=[<NullHandler(lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f21f4850180>, _RLock__count=0) at remote 0x7f21f478b8d0>, formatter=None, filters=[], level=0) at remote 0xdc4128>, <StreamHandler(lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f21f48501c8>, _RLock__count=0) at remote 0xdc3710>, formatter=None, stream=<file at remote 0x7f21f49101e0>, filters=[], level=10) at remote 0xdc7050>], level=10, disabled=0, propagate=1, filters=[]) at remote 0x7f21f47b35a8>, handlers=[<StreamHandler(lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f21f4850168>, _RLock__count=0) at remote 0x7f21f478b590>, formatter=<Formatter(datefmt=None, _fmt='[%(process)d] %(...(truncated), throwflag=<value optimized out>)
    at ../Python/ceval.c:2452
#26 0x00000000004a9671 in PyEval_EvalCodeEx (co=0x7f21f47a4300, globals=<value optimized out>, locals=<value optimized out>, args=0x2, argcount=<value optimized out>, kws=<value optimized out>, kwcount=0, defs=0x0, defcount=0, closure=0x0) at ../Python/ceval.c:3000
#27 0x00000000004a7809 in fast_function (f=
    Frame 0xf3aed0, for file /nail/home/jbrown/repos/zygote_test/zygote/zygote/worker.py, line 30, in zygote_exit (signum=15, frame=Frame 0xed9410, for file /usr/lib/python2.6/threading.py, line 123, in acquire (self=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f21f4850168>, _RLock__count=0) at remote 0x7f21f478b590>, blocking=1, me=139783813953280)), throwflag=<value optimized out>) at ../Python/ceval.c:3846

This should be treated as P0, since it can prevent zygote from switching over on version changes.

Pretty easy to repro. Just keep starting up a simple zygote-using application and ctrl-cing it until you get an error binding to port.

I have only observed the zygote itself getting blocked (not the worker processes or the master). That is only anecdotal, though.

baris commented 12 years ago

I just got the same traceback. Python doc says this: http://docs.python.org/library/logging.html#thread-safety

baris commented 12 years ago

I'll take this issue and try a fix with a custom logging handler.

baris commented 11 years ago

This should be fixed, with merge #46. Closing.