Intermittent threading hangs with process pools

erikrose commented 8 years ago

I've been chasing this for a year or so (across various versions of Python and futures—this time 2.7.6 and 3.0.3) and finally went through the rigamarole of settings up the Python gdb tools to get some decent tracebacks out of it. In short, during large jobs with thousands of tasks, execution sometimes hangs. It runs for about an hour, getting somewhere between 11-17% done in the current reproduction; conveniently, I have a progress bar. The variation makes me think it's some kind of timing bug. The CPU use slowly falls down to 0 as the worker processes complete and no new ones are scheduled to replace them. I end up with a process table like this:

  585 ?        00:01:58 dxr
  605 ?        01:03:30 dxr <defunct>
  606 ?        00:22:48 dxr <defunct>
  607 ?        00:01:50 dxr <defunct>
  609 ?        00:17:43 dxr <defunct>

The defunct processes are the workers. Adding -L, we can see the threads futures spins up to coordinate the work distribution:

  585   585 ?        00:00:39 dxr
  585   603 ?        00:00:00 dxr
  585   604 ?        00:00:00 dxr
  585   608 ?        00:01:16 dxr
  605   605 ?        01:03:30 dxr <defunct>
  606   606 ?        00:22:48 dxr <defunct>
  607   607 ?        00:01:50 dxr <defunct>
  609   609 ?        00:17:43 dxr <defunct>

I don't know why there are only 3 of them, when my process pool is of size 4. Maybe that's a clue?

The Python traceback, from attaching with gdb and using its Python tools, looks like this:

(gdb) py-bt
#5 Frame 0x23c2e90, for file /usr/lib/python2.7/threading.py, line 339, in wait (self=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f2716b74ea0>, acquire=<built-in method acquire of thread.lock object at remote 0x7f2716b74ea0>, _Condition__waiters=[<thread.lock at remote 0x7f2716b74e00>], release=<built-in method release of thread.lock object at remote 0x7f2716b74ea0>) at remote 0x7f2716b5c920>, timeout=None, waiter=<thread.lock at remote 0x7f2716b74e00>, saved_state=None)
    waiter.acquire()
#9 Frame 0x7f272708f460, for file /usr/lib/python2.7/threading.py, line 620, in wait (self=<_Event(_Verbose__verbose=False, _Event__flag=False, _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f2716b74ea0>, acquire=<built-in method acquire of thread.lock object at remote 0x7f2716b74ea0>, _Condition__waiters=[<thread.lock at remote 0x7f2716b74e00>], release=<built-in method release of thread.lock object at remote 0x7f2716b74ea0>) at remote 0x7f2716b5c920>) at remote 0x7f2716b5c840>, timeout=None)
    self.__cond.wait(timeout)
#13 Frame 0x24189e0, for file /code/dbg-venv/local/lib/python2.7/site-packages/concurrent/futures/_base.py, line 217, in as_completed (fs=set([<...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <.....(truncated)
    waiter.event.wait(wait_timeout)
#19 Frame 0x2424df0, for file /code/dbg-venv/local/lib/python2.7/site-packages/click/_termui_impl.py, line 240, in next (self=<ProgressBar(color=None, pos=47, length_known=True, max_width=62, file=<_NonClosingTextIOWrapper(_stream=<_FixupStream(_stream=<file at remote 0x7f272f2751c0>) at remote 0x7f27270a6fb0>) at remote 0x7f272b1edc60>, is_hidden=False, avg=[<float at remote 0x25d2558>, <float at remote 0x25d2530>, <float at remote 0x25d2508>, <float at remote 0x25d24e0>, <float at remote 0x25d24b8>, <float at remote 0x25d25a8>, <float at remote 0x25d25d0>], last_eta=<float at remote 0x25d2468>, width=36, info_sep='  ', bar_template='%(label)-18s [%(bar)s] %(info)s', label='Indexing files', empty_char='-', start=<float at remote 0x2360e88>, entered=True, item_show_func=None, autowidth=False, show_percent=None, show_pos=False, finished=False, fill_char='#', eta_known=True, show_eta=False, iter=<generator at remote 0x7f2716b4cb60>, length=273, current_item=<Future(_exception=None, _result=None, _condition=<_Condit...(truncated)
    rv = next(self.iter)
#27 Frame 0x24187a0, for file /home/dxr/dxr/dxr/build.py, line 325, in show_progress (futures=[<Future(_exception=None, _result=None, _condition=<_Condition(_Condition__lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f272bb3bc20>, _RLock__count=0) at remote 0x7f27270a6f40>, acquire=<instancemethod at remote 0x7f27270ee9e0>, _is_owned=<instancemethod at remote 0x7f27270ee4e0>, _release_save=<instancemethod at remote 0x7f27270ee8e0>, release=<instancemethod at remote 0x7f27270ee5e0>, _acquire_restore=<instancemethod at remote 0x7f27270eeae0>, _Verbose__verbose=False, _Condition__waiters=[]) at remote 0x7f272f13d4c0>, _state='RUNNING', _traceback=None, _waiters=[<_AsCompletedWaiter(lock=<thread.lock at remote 0x7f2716b74f40>, event=<_Event(_Verbose__verbose=False, _Event__flag=False, _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f2716b74ea0>, acquire=<built-in method acquire of thread.lock object at remote 0x7f2716b74ea0...(truncated)
    for future in bar:
#30 Frame 0x2411980, for file /home/dxr/dxr/dxr/build.py, line 638, in index_files (tree=<TreeConfig(_section={'ignore_filenames': ['.hg', '.git', 'CVS', '.svn', '.bzr', '.deps', '.libs', '.DS_Store', '.nfs*', '*~', '._*'], 'build_command': 'cd $source_folder && ./mach clobber && make -f client.mk build MOZ_OBJDIR=$object_folder MOZ_MAKE_FLAGS="-s -j$jobs"', 'description': '', 'core': {}, 'enabled_plugins': [<Plugin(tree_to_index=<type at remote 0x2325170>, file_to_skim=None, filters=[<type at remote 0x232e220>, <type at remote 0x2324270>, <type at remote 0x2323e80>, <type at remote 0x2324d80>, <type at remote 0x233b890>, <type at remote 0x2324990>], analyzers={'tokenizer': {'trigram_tokenizer': {'max_gram': 3, 'type': 'nGram', 'min_gram': 3}}, 'analyzer': {'trigramalyzer': {'type': 'custom', 'tokenizer': 'trigram_tokenizer'}, 'lowercase': {'filter': ['lowercase'], 'type': 'custom', 'tokenizer': 'keyword'}, 'trigramalyzer_lower': {'filter': ['lowercase'], 'type': 'custom', 'tokenizer': 'trigram_tokenizer'}}}, ref...(truncated)
    for future in show_progress(futures, 'Indexing files'):
#33 Frame 0x231ea40, for file /home/dxr/dxr/dxr/build.py, line 275, in index_tree (tree=<TreeConfig(_section={'ignore_filenames': ['.hg', '.git', 'CVS', '.svn', '.bzr', '.deps', '.libs', '.DS_Store', '.nfs*', '*~', '._*'], 'build_command': 'cd $source_folder && ./mach clobber && make -f client.mk build MOZ_OBJDIR=$object_folder MOZ_MAKE_FLAGS="-s -j$jobs"', 'description': '', 'core': {}, 'enabled_plugins': [<Plugin(tree_to_index=<type at remote 0x2325170>, file_to_skim=None, filters=[<type at remote 0x232e220>, <type at remote 0x2324270>, <type at remote 0x2323e80>, <type at remote 0x2324d80>, <type at remote 0x233b890>, <type at remote 0x2324990>], analyzers={'tokenizer': {'trigram_tokenizer': {'max_gram': 3, 'type': 'nGram', 'min_gram': 3}}, 'analyzer': {'trigramalyzer': {'type': 'custom', 'tokenizer': 'trigram_tokenizer'}, 'lowercase': {'filter': ['lowercase'], 'type': 'custom', 'tokenizer': 'keyword'}, 'trigramalyzer_lower': {'filter': ['lowercase'], 'type': 'custom', 'tokenizer': 'trigram_tokenizer'}}}, refs...(truncated)
    index_files(tree, tree_indexers, index, pool, es)
#37 Frame 0x23ea910, for file /home/dxr/dxr/dxr/build.py, line 64, in index_and_deploy_tree (tree=<TreeConfig(_section={'ignore_filenames': ['.hg', '.git', 'CVS', '.svn', '.bzr', '.deps', '.libs', '.DS_Store', '.nfs*', '*~', '._*'], 'build_command': 'cd $source_folder && ./mach clobber && make -f client.mk build MOZ_OBJDIR=$object_folder MOZ_MAKE_FLAGS="-s -j$jobs"', 'description': '', 'core': {}, 'enabled_plugins': [<Plugin(tree_to_index=<type at remote 0x2325170>, file_to_skim=None, filters=[<type at remote 0x232e220>, <type at remote 0x2324270>, <type at remote 0x2323e80>, <type at remote 0x2324d80>, <type at remote 0x233b890>, <type at remote 0x2324990>], analyzers={'tokenizer': {'trigram_tokenizer': {'max_gram': 3, 'type': 'nGram', 'min_gram': 3}}, 'analyzer': {'trigramalyzer': {'type': 'custom', 'tokenizer': 'trigram_tokenizer'}, 'lowercase': {'filter': ['lowercase'], 'type': 'custom', 'tokenizer': 'keyword'}, 'trigramalyzer_lower': {'filter': ['lowercase'], 'type': 'custom', 'tokenizer': 'trigram_tokenizer...(truncated)
    index_name = index_tree(tree, es, verbose=verbose)
#41 Frame 0x231f140, for file /home/dxr/dxr/dxr/cli/index.py, line 26, in index (config=<Config(_section={'es_hosts': 'http://12---Type <return> to continue, or q <return> to quit---
7.0.0.1:9200/', 'default_tree': 'mozilla-central', 'max_thumbnail_size': 20000, 'es_alias': 'dxr_{format}_{tree}', 'es_catalog_index': 'dxr_catalog', 'workers': 4, 'log_folder': '/code/dxr-logs-{tree}', 'es_indexing_timeout': 60, 'temp_folder': '/code/dxr-temp-{tree}', 'google_analytics_key': '', 'es_refresh_interval': 60, 'es_catalog_replicas': 1, 'www_root': '', 'es_index': 'dxr_{format}_{tree}_{unique}', 'generated_date': 'Fri, 05 Feb 2016 18:27:38 +0000', 'skip_stages': ['build']}, trees=<OrderedDict(_OrderedDict__map={'mozilla-central': ['mozilla-central', [None, [...], [...]], [...]]}, _OrderedDict__end=[...]) at remote 0x7f272b1ae5c0>) at remote 0x7f272b5e9300>, verbose=False, tree_names=(), tree=<TreeConfig(_section={'ignore_filenames': ['.hg', '.git', 'CVS', '.svn', '.bzr', '.deps', '.libs', '.DS_Store', '.nfs*', '*~', '._*'], 'build_command': 'cd $source_folder && ./mach clobbe...(truncated)
...

Here's the calling code.

Here's the C traceback as well, in case it's helpful:

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
#1  0x000000000056e2a4 in PyThread_acquire_lock (lock=0x22b12e0, waitflag=1) at ../Python/thread_pthread.h:324
#2  0x000000000060fba9 in lock_PyThread_acquire_lock (self=0x7f2716b74e00, args=()) at ../Modules/threadmodule.c:52
#3  0x00000000004877aa in PyCFunction_Call (func=<built-in method acquire of thread.lock object at remote 0x7f2716b74e00>,
    arg=(), kw=0x0) at ../Objects/methodobject.c:81
#4  0x00000000005273b4 in call_function (pp_stack=0x7ffcb97d7450, oparg=0) at ../Python/ceval.c:4020
#5  0x00000000005222e1 in PyEval_EvalFrameEx (
    f=Frame 0x23c2e90, for file /usr/lib/python2.7/threading.py, line 339, in wait (self=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f2716b74ea0>, acquire=<built-in method acquire of thread.lock object at remote 0x7f2716b74ea0>, _Condition__waiters=[<thread.lock at remote 0x7f2716b74e00>], release=<built-in method release of thread.lock object at remote 0x7f2716b74ea0>) at remote 0x7f2716b5c920>, timeout=None, waiter=<thread.lock at remote 0x7f2716b74e00>, saved_state=None), throwflag=0) at ../Python/ceval.c:2666
#6  0x0000000000524b9a in PyEval_EvalCodeEx (co=0x7f272ce09720,
    globals={'current_thread': <function at remote 0x7f272ce27060>, '_BoundedSemaphore': <type at remote 0x1f88210>, 'currentThread': <function at remote 0x7f272ce27060>, '_Timer': <type at remote 0x1f8a6a0>, '_format_exc': <function at remote 0x7f272ce16c30>, 'Semaphore': <function at remote 0x7f272ce22a38>, '_deque': <type at remote 0x8c7c60>, 'activeCount': <function at remote 0x7f272ce27300>, '_profile_hook': None, '_sleep': <built-in function sleep>, '_trace_hook': None, 'ThreadError': <type at remote 0x1df77d0>, '_enumerate': <function at remote 0x7f272ce273a8>, '_start_new_thread': <built-in function start_new_thread>, 'BoundedSemaphore': <function at remote 0x7f272ce241b0>, '_shutdown': <instancemethod at remote 0x7f272d2048e0>, '__all__': ['activeCount', 'active_count', 'Condition', 'currentThread', 'current_thread', 'enumerate', 'Event', 'Lock', 'RLock', 'Semaphore', 'BoundedSemaphore', 'Thread', 'Timer', 'setprofile', 'settrace', 'local', 'stack_size'], '_Event': <type at remote 0x1f88930>, 'active_count': <fu...(truncated), locals=0x0, args=0x7f272708f5f8, argcount=2, kws=0x7f272708f608, kwcount=0, defs=0x7f272ce23168,
    defcount=1, closure=0x0) at ../Python/ceval.c:3252
#7  0x000000000052799e in fast_function (func=<function at remote 0x7f272ce22f78>, pp_stack=0x7ffcb97d78f0, n=2, na=2, nk=0)
    at ../Python/ceval.c:4116
#8  0x0000000000527588 in call_function (pp_stack=0x7ffcb97d78f0, oparg=1) at ../Python/ceval.c:4041
#9  0x00000000005222e1 in PyEval_EvalFrameEx (
    f=Frame 0x7f272708f460, for file /usr/lib/python2.7/threading.py, line 620, in wait (self=<_Event(_Verbose__verbose=False, _Event__flag=False, _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f2716b74ea0>, acquire=<built-in method acquire of thread.lock object at remote 0x7f2716b74ea0>, _Condition__waiters=[<thread.lock at remote 0x7f2716b74e00>], release=<built-in method release of thread.lock object at remote 0x7f2716b74ea0>) at remote 0x7f2716b5c920>) at remote 0x7f2716b5c840>, timeout=None), throwflag=0) at ../Python/ceval.c:2666
#10 0x0000000000524b9a in PyEval_EvalCodeEx (co=0x7f272ce0e5c0,
    globals={'current_thread': <function at remote 0x7f272ce27060>, '_BoundedSemaphore': <type at remote 0x1f88210>, 'currentThread': <function at remote 0x7f272ce27060>, '_Timer': <type at remote 0x1f8a6a0>, '_format_exc': <function at remote 0x7f272ce16c30>, 'Semaphore': <function at remote 0x7f272ce22a38>, '_deque': <type at remote 0x8c7c60>, 'activeCount': <function at remote 0x7f272ce27300>, '_profile_hook': None, '_sleep': <built-in function sleep>, '_trace_hook': None, 'ThreadError': <type at remote 0x1df77d0>, '_enumerate': <function at remote 0x7f272ce273a8>, '_start_new_thread': <built-in function start_new_thread>, 'BoundedSemaphore': <function at remote 0x7f272ce241b0>, '_shutdown': <instancemethod at remote 0x7f272d2048e0>, '__all__': ['activeCount', 'active_count', 'Condition', 'currentThread', 'current_thread', 'enumerate', 'Event', 'Lock', 'RLock', 'Semaphore', 'BoundedSemaphore', 'Thread', 'Timer', 'setprofile', 'settrace', 'local', 'stack_size'], '_Event': <type at remote 0x1f88930>, 'active_count': <fu...(truncated), locals=0x0, args=0x2418bb0, argcount=2, kws=0x2418bc0, kwcount=0, defs=0x7f272ce23478, defcount=1,
    closure=0x0) at ../Python/ceval.c:3252
#11 0x000000000052799e in fast_function (func=<function at remote 0x7f272ce24ae0>, pp_stack=0x7ffcb97d7d90, n=2, na=2, nk=0)
    at ../Python/ceval.c:4116
#12 0x0000000000527588 in call_function (pp_stack=0x7ffcb97d7d90, oparg=1) at ../Python/ceval.c:4041
#13 0x00000000005222e1 in PyEval_EvalFrameEx (
    f=Frame 0x24189e0, for file /code/dbg-venv/local/lib/python2.7/site-packages/concurrent/futures/_base.py, line 217, in as_completed (fs=set([<...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <...>, <.....(truncated), throwflag=0) at ../Python/ceval.c:2666
#14 0x0000000000452c14 in gen_send_ex (gen=0x7f2716b4cb60, arg=0x0, exc=0) at ../Objects/genobject.c:85
#15 0x0000000000453581 in gen_iternext (gen=0x7f2716b4cb60) at ../Objects/genobject.c:283
#16 0x0000000000513dce in builtin_next (self=0x0, args=(<generator at remote 0x7f2716b4cb60>,)) at ../Python/bltinmodule.c:1107
#17 0x00000000004877aa in PyCFunction_Call (func=<built-in function next>, arg=(<generator at remote 0x7f2716b4cb60>,), kw=0x0)
    at ../Objects/methodobject.c:81
#18 0x00000000005273b4 in call_function (pp_stack=0x7ffcb97d8120, oparg=1) at ../Python/ceval.c:4020
#19 0x00000000005222e1 in PyEval_EvalFrameEx (
    f=Frame 0x2424df0, for file /code/dbg-venv/local/lib/python2.7/site-packages/click/_termui_impl.py, line 240, in next (self=<ProgressBar(color=None, pos=47, length_known=True, max_width=62, file=<_NonClosingTextIOWrapper(_stream=<_FixupStream(_stream=<file at remote 0x7f272f2751c0>) at remote 0x7f27270a6fb0>) at remote 0x7f272b1edc60>, is_hidden=False, avg=[<float at remote 0x25d2558>, <float at remote 0x25d2530>, <float at remote 0x25d2508>, <float at remote 0x25d24e0>, <float at remote 0x25d24b8>, <float at remote 0x25d25a8>, <float at remote 0x25d25d0>], last_eta=<float at remote 0x25d2468>, width=36, info_sep='  ', bar_template='%(label)-18s [%(bar)s] %(info)s', label='Indexing files', empty_char='-', start=<float at remote 0x2360e88>, entered=True, item_show_func=None, autowidth=False, show_percent=None, show_pos=False, finished=False, fill_char='#', eta_known=True, show_eta=False, iter=<generator at remote 0x7f2716b4cb60>, length=273, current_item=<Future(_exception=None, _result=None, _condition=<_Condit...(truncated), throwflag=0) at ../Python/ceval.c:2666
...

Let me know if I can supply any more information. I'm also not sure if this is more properly filed with upstream, as my codebase isn't Python 3 clean. Thank you!

erikrose commented 8 years ago

As I turned off concurrency in my project to debug further, I unconvered a segfault in the work that one of the workers would have been doing. Would a segfaulting worker cause the master to hang?

agronholm commented 8 years ago

Seems unlikely but without a way to test it myself, I can't say anything conclusive. Subprocess handling is really flaky on Python 2 and that is unfortunately unfixable to my best knowledge.

gromgull commented 7 years ago

I have a similar issue - my best theory is something around this: http://bugs.python.org/issue6721

If you have multiple threads (which you will do if you use multiprocessing, since there are queue feeder threads etc.) and one of these threads holds a lock when the process forks, the lock is never released on the child. In particular, this affects logging (but the multiprocessing has it's own logging which apparently is not affected), and also flushing stdout.

gromgull commented 7 years ago

I've made a small test case that displays this problem: https://gist.github.com/gromgull/3a2e343d50184a853fcf1dca5e690a6b

This breaks on python 2.7.12 for me, maybe 25-50% of the time.

gromgull commented 7 years ago

One last comment: if I tweak that example slightly and make it use multiprocessing.Pool and async_apply, it runs fine in 2.7

agronholm commented 7 years ago

I'm not sure what you expect here. Forking with threads is a Really Really Bad Idea, and to the best of my knowledge, subprocess handling is permanently broken on (C)Python 2.

gromgull commented 7 years ago

I know threads and forks don't mix, and my test program above has no threads of its own. The link above just meant as a possible explanation - since the multiprocessing module uses threads internally (you can see this if you add a print threading.enumerate() while the pool is active), maybe it is somehow related?

If the bottom line is that a robust ProcessPoolExecutor is impossible on 2.7 - maybe it should log a warning to that effect when instantiated?

agronholm commented 7 years ago

Yes, that is certainly worth considering.

agronholm commented 7 years ago

I've mentioned this problem in the README. Good enough?

erikrose commented 7 years ago

It's an improvement, but I think @gromgull's suggestion of a warning would be ideal. If you want to stick with just the readme mention, I'd add a few more keywords like "hang" so search engines find it. Cheers!

agronholm / pythonfutures

Intermittent threading hangs with process pools #52