Closed sodul closed 3 years ago
We got the issue 8 out of 10 times on our CI pipeline. By adding -s 1
to disable multithreading for this directory we passed 5 out of 5 times. So we have a temporary workaround.
With the help of this post:https://stackoverflow.com/questions/15349997/assertionerror-when-threading-in-python
We found out that the library under test has a subclass of threading.Thread()
but with a different signature. We suspect that the code need to be refactored to behave better. It looks like the bug is not in green, but the fact that green get stuck is concerning.
I suppose it got stuck because the code under test got stuck.
Glad to hear the problem was somewhere else. 😉
@CleanCut unfortunately I combed over the library that is causing green to freeze and I could only find a small bit of code where we subclass threading.Thread so we can collect the return value of run(). We do not use threads anywhere else in that library and we are not using multiprocessing
either.
I was not able to reproduce locally on macOS either natively or under the same docker container we use in our CI. It is very consistent on our CI system with python 3.9.6.
I see that you wrote green.process.LoggingDaemonlessPool._repopulate_pool()
back in 2015, and that the official Pool
class now has a _repopulate_pool_static()
that was added 2y ago, bypassing your _repopulate_pool()
. I believe that LoggingDaemonlessPool
needs to be updated with its own _repopulate_pool_static()
.
That's my 1AM analysis and might be wrong but so far I really think the bug is in green
.
I did some experiments and patching green/process.py with no success. I'll investigate further and will let you know.
I did some experiments and patching green/process.py with no success. I'll investigate further and will let you know.
👍🏻 I'll wait to hear what you find.
I've patched multiprocessing.process.BaseProcess
to get more context information:
def __init__(self, group=None, target=None, name=None, args=(), kwargs={}, *, daemon=None):
if group is not None:
import inspect
import sys
import traceback
print()
print('current process:', current_process())
print(f'sys.argv: {sys.argv}')
print(
f'group: {group!r}, target: {target}, name: {name}, args: {args}, kwargs: {kwargs}'
)
print('target signature: ', inspect.signature(target))
traceback.print_stack()
initializer_or_finalizer = args[2]
print(
'InitializerOrFinalizer module:',
repr(initializer_or_finalizer.module_part),
'function:',
repr(initializer_or_finalizer.function_part),
)
print()
assert group is None, 'group argument must be None for now'
This is what I get:
current process: <_MainProcess name='MainProcess' parent=None started>
sys.argv: ['/root/.pyenv/versions/3.9.6/lib/python3.9/site-packages/green/__main__.py']
group: <multiprocessing.context.ForkContext object at 0x7ffb5da34250>, target: <function worker at 0x7ffb5d7543a0>, name: None, args: (<multiprocessing.queues.SimpleQueue object at 0x7ffb23f6a1f0>, <multiprocessing.queues.SimpleQueue object at 0x7ffb254337c0>, <green.runner.InitializerOrFinalizer object at 0x7ffb585d75e0>, (), None, True), kwargs: {}
target signature: (inqueue, outqueue, initializer=None, initargs=(), maxtasks=None, wrap_exception=False, finalizer=None, finalargs=())
File "/root/.pyenv/versions/3.9.6/lib/python3.9/threading.py", line 930, in _bootstrap
self._bootstrap_inner()
File "/root/.pyenv/versions/3.9.6/lib/python3.9/threading.py", line 973, in _bootstrap_inner
self.run()
File "/root/.pyenv/versions/3.9.6/lib/python3.9/threading.py", line 910, in run
self._target(*self._args, **self._kwargs)
File "/root/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/pool.py", line 513, in _handle_workers
cls._maintain_pool(ctx, Process, processes, pool, inqueue,
File "/root/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/pool.py", line 337, in _maintain_pool
Pool._repopulate_pool_static(ctx, Process, processes, pool,
File "/root/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/pool.py", line 319, in _repopulate_pool_static
w = Process(ctx, target=worker,
File "/root/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/process.py", line 100, in __init__
traceback.print_stack()
InitializerOrFinalizer module: '' function: ''
Somehow we end up in the init of BaseProcess with a group argument set to ForkContext
which causes an assert to be raised.
I do not have deep experience with the multiprocessing library so my understanding is limited.
The other issue is that this has been very difficult to reproduce. It is very consistently failing under our CI pipeline conditions but changing some elements avoids the issue:
Let me know if the data I dumped helps you and if there are other things you would like me to dump. Unfortunately I cannot share a reproducible case since the docker image and our code is private.
I'll pause my investigation for a few days. I tried many different combinations but I am still unable to pinpoint any specific test or combination of tests that trigger the issue. It seems that the number of tests might be a factor since we have a few hundred tests in that directory.
The next thing I will try is to revert the two threading patches added in python 3.9.6 and see if they are the ones triggering the issue.
Good news! I was able to reproduce the problem! 🎉
That diff you linked to was super helpful. You were correct that the problem was because of the addition of _repopulate_pool_static
, which broke repopulating pool subprocesses after they have died. Since green
sets the processes to never die, this usually wouldn't be a problem, but somehow your workload is managing to kill all of the subprocesses -- and then since the pool has no workers pulling jobs from the queue, everything just hangs forever.
I was able to reproduce the problem in aa572e1a1e40e091882a565a9200e3242453cf22 by adding a -X, --max--maxtasksperchild
option and setting that to 1, and then trying to run more suites than subprocesses that get launched. I hit a hang in Python 3.8 and 3.9, but Python 3.7 works fine.
Now to fix the problem...
Fixed in Green 3.3.0, which I just released. 🥳 🎈 🎉
Thanks a lot for the quick fix!!! I'll try 3.3.0 in our CI pipeline tomorrow and will let you know how it goes.
🤞🏻
I've tested green 3.3.0 and I can confirm that the AssertionError stack trace is gone, however the tests are still stuck in parallel mode and never return. Passing the -s 1
option to green bypasses the problem so we are not stuck but there is still a problem somewhere.
I'll try to run some more experiments over the weekend. I'm wondering if adding a timeout per test might help me pinpoint which test might cause the situation to trigger.
The good new is that thanks to your new CLI option I was able to reproduce on my laptop under macOS which will help a lot.
green -r -v -s 32 --maxtasksperchild 1
. This will allow me to cycle through experiments much faster.
Here is what I suspect might be happening. We run our CI under Kubernetes where the container is not getting the full CPU capacity of the host, but many processes in the container still believe the process has access to all the CPU/RAM resources. Without -s
being set green is probably setting a number of cores (16) the host is claiming but the container is only allocated 4 k8s CPUs. Setting -s 4
seems to fix the issue as well and TBH doing more than 4 is way overkill so I'll set that as the default to pass to green from our wrapper.
I did confirm that os.cpu_count()
, which Pool()
calls when process is None, is returning 16 on our CI pipelines.
As far as I'm concern green is now even better today than it was last week :-) Thanks a lot.
Here is what I suspect might be happening. We run our CI under Kubernetes where the container is not getting the full CPU capacity of the host, but many processes in the container still believe the process has access to all the CPU/RAM resources. Without
-s
being set green is probably setting a number of cores (16) the host is claiming but the container is only allocated 4 k8s CPUs. Setting-s 4
seems to fix the issue as well and TBH doing more than 4 is way overkill so I'll set that as the default to pass to green from our wrapper.
Oh, good. I'm glad you found a workaround. If we can find both a) a way to determine that we are running in k8s (an environment variable that k8s always sets, maybe?), AND b) a way for Python to determine a better CPU number in k8s, then we could have green adjust it's auto cpu count under k8s.
If we do go down this path, let's open a new issue to have the discussion in.
As far as I'm concern green is now even better today than it was last week :-) Thanks a lot.
You are most welcome!
We are encountering a high rate of failure with some of our tests after upgrading to Python 3.9.6.
This is failing most of the time under docker, not so much under macOS with the same versions of green (3.2.6) and python (3.9.6).
I'm not sure if the issue is with green itself but the stack trace seems to come from green.