Open erikrose opened 8 years ago
As I turned off concurrency in my project to debug further, I unconvered a segfault in the work that one of the workers would have been doing. Would a segfaulting worker cause the master to hang?
Seems unlikely but without a way to test it myself, I can't say anything conclusive. Subprocess handling is really flaky on Python 2 and that is unfortunately unfixable to my best knowledge.
I have a similar issue - my best theory is something around this: http://bugs.python.org/issue6721
If you have multiple threads (which you will do if you use multiprocessing, since there are queue feeder threads etc.) and one of these threads holds a lock when the process forks, the lock is never released on the child. In particular, this affects logging (but the multiprocessing has it's own logging which apparently is not affected), and also flushing stdout.
I've made a small test case that displays this problem: https://gist.github.com/gromgull/3a2e343d50184a853fcf1dca5e690a6b
This breaks on python 2.7.12 for me, maybe 25-50% of the time.
One last comment: if I tweak that example slightly and make it use multiprocessing.Pool
and async_apply
, it runs fine in 2.7
I'm not sure what you expect here. Forking with threads is a Really Really Bad Idea, and to the best of my knowledge, subprocess handling is permanently broken on (C)Python 2.
I know threads and forks don't mix, and my test program above has no threads of its own. The link above just meant as a possible explanation - since the multiprocessing module uses threads internally (you can see this if you add a print threading.enumerate()
while the pool is active), maybe it is somehow related?
If the bottom line is that a robust ProcessPoolExecutor
is impossible on 2.7 - maybe it should log a warning to that effect when instantiated?
Yes, that is certainly worth considering.
I've mentioned this problem in the README. Good enough?
It's an improvement, but I think @gromgull's suggestion of a warning would be ideal. If you want to stick with just the readme mention, I'd add a few more keywords like "hang" so search engines find it. Cheers!
I've been chasing this for a year or so (across various versions of Python and futures—this time 2.7.6 and 3.0.3) and finally went through the rigamarole of settings up the Python gdb tools to get some decent tracebacks out of it. In short, during large jobs with thousands of tasks, execution sometimes hangs. It runs for about an hour, getting somewhere between 11-17% done in the current reproduction; conveniently, I have a progress bar. The variation makes me think it's some kind of timing bug. The CPU use slowly falls down to 0 as the worker processes complete and no new ones are scheduled to replace them. I end up with a process table like this:
The defunct processes are the workers. Adding
-L
, we can see the threads futures spins up to coordinate the work distribution:I don't know why there are only 3 of them, when my process pool is of size 4. Maybe that's a clue?
The Python traceback, from attaching with gdb and using its Python tools, looks like this:
Here's the calling code.
Here's the C traceback as well, in case it's helpful:
Let me know if I can supply any more information. I'm also not sure if this is more properly filed with upstream, as my codebase isn't Python 3 clean. Thank you!