Python 3.9.6 threading compatibility

sodul commented 3 years ago

We are encountering a high rate of failure with some of our tests after upgrading to Python 3.9.6.

> python3 -m green --run-coverage --cov-config-file my_lib.coveragerc --junit-report my_lib-pytests.xml --include-patterns 'my_lib/*' my_lib
.............................Exception in thread Thread-1:
Traceback (most recent call last):
  File ".pyenv/versions/3.9.6/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File ".pyenv/versions/3.9.6/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File ".pyenv/versions/3.9.6/lib/python3.9/multiprocessing/pool.py", line 513, in _handle_workers
    cls._maintain_pool(ctx, Process, processes, pool, inqueue,
  File ".pyenv/versions/3.9.6/lib/python3.9/multiprocessing/pool.py", line 337, in _maintain_pool
    Pool._repopulate_pool_static(ctx, Process, processes, pool,
  File ".pyenv/versions/3.9.6/lib/python3.9/multiprocessing/pool.py", line 319, in _repopulate_pool_static
    w = Process(ctx, target=worker,
  File ".pyenv/versions/3.9.6/lib/python3.9/multiprocessing/process.py", line 82, in __init__
    assert group is None, 'group argument must be None for now'
AssertionError: group argument must be None for now
................................................................................................................................................................................................................................................................................................................................................................................................................................................Sending interrupt signal to process
Killing processes
kill finished with exit code 0

This is failing most of the time under docker, not so much under macOS with the same versions of green (3.2.6) and python (3.9.6).

I'm not sure if the issue is with green itself but the stack trace seems to come from green.

sodul commented 3 years ago

We got the issue 8 out of 10 times on our CI pipeline. By adding -s 1 to disable multithreading for this directory we passed 5 out of 5 times. So we have a temporary workaround.

sodul commented 3 years ago

With the help of this post:https://stackoverflow.com/questions/15349997/assertionerror-when-threading-in-python

We found out that the library under test has a subclass of threading.Thread() but with a different signature. We suspect that the code need to be refactored to behave better. It looks like the bug is not in green, but the fact that green get stuck is concerning.

I suppose it got stuck because the code under test got stuck.

CleanCut commented 3 years ago

Glad to hear the problem was somewhere else. 😉

sodul commented 3 years ago

@CleanCut unfortunately I combed over the library that is causing green to freeze and I could only find a small bit of code where we subclass threading.Thread so we can collect the return value of run(). We do not use threads anywhere else in that library and we are not using multiprocessing either.

I was not able to reproduce locally on macOS either natively or under the same docker container we use in our CI. It is very consistent on our CI system with python 3.9.6.

I see that you wrote green.process.LoggingDaemonlessPool._repopulate_pool() back in 2015, and that the official Pool class now has a _repopulate_pool_static() that was added 2y ago, bypassing your _repopulate_pool(). I believe that LoggingDaemonlessPool needs to be updated with its own _repopulate_pool_static().

https://github.com/python/cpython/pull/11627/files#diff-868ec74e6c7b13cb1ba8185bbdb6003a23671a5d2c7e1123de4ffb4250f6932bR269

That's my 1AM analysis and might be wrong but so far I really think the bug is in green.

sodul commented 3 years ago

I did some experiments and patching green/process.py with no success. I'll investigate further and will let you know.

CleanCut commented 3 years ago

I did some experiments and patching green/process.py with no success. I'll investigate further and will let you know.

👍🏻 I'll wait to hear what you find.

sodul commented 3 years ago

I've patched multiprocessing.process.BaseProcess to get more context information:

    def __init__(self, group=None, target=None, name=None, args=(), kwargs={}, *, daemon=None):
        if group is not None:
            import inspect
            import sys
            import traceback

            print()
            print('current process:', current_process())
            print(f'sys.argv: {sys.argv}')
            print(
                f'group: {group!r}, target: {target}, name: {name}, args: {args}, kwargs: {kwargs}'
            )
            print('target signature: ', inspect.signature(target))
            traceback.print_stack()
            initializer_or_finalizer = args[2]
            print(
                'InitializerOrFinalizer module:',
                repr(initializer_or_finalizer.module_part),
                'function:',
                repr(initializer_or_finalizer.function_part),
            )
            print()
        assert group is None, 'group argument must be None for now'

This is what I get:

current process: <_MainProcess name='MainProcess' parent=None started>
sys.argv: ['/root/.pyenv/versions/3.9.6/lib/python3.9/site-packages/green/__main__.py']
group: <multiprocessing.context.ForkContext object at 0x7ffb5da34250>, target: <function worker at 0x7ffb5d7543a0>, name: None, args: (<multiprocessing.queues.SimpleQueue object at 0x7ffb23f6a1f0>, <multiprocessing.queues.SimpleQueue object at 0x7ffb254337c0>, <green.runner.InitializerOrFinalizer object at 0x7ffb585d75e0>, (), None, True), kwargs: {}
target signature:  (inqueue, outqueue, initializer=None, initargs=(), maxtasks=None, wrap_exception=False, finalizer=None, finalargs=())
  File "/root/.pyenv/versions/3.9.6/lib/python3.9/threading.py", line 930, in _bootstrap
    self._bootstrap_inner()
  File "/root/.pyenv/versions/3.9.6/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/root/.pyenv/versions/3.9.6/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/root/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/pool.py", line 513, in _handle_workers
    cls._maintain_pool(ctx, Process, processes, pool, inqueue,
  File "/root/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/pool.py", line 337, in _maintain_pool
    Pool._repopulate_pool_static(ctx, Process, processes, pool,
  File "/root/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/pool.py", line 319, in _repopulate_pool_static
    w = Process(ctx, target=worker,
  File "/root/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/process.py", line 100, in __init__
    traceback.print_stack()
InitializerOrFinalizer module: '' function: ''

Somehow we end up in the init of BaseProcess with a group argument set to ForkContext which causes an assert to be raised.

I do not have deep experience with the multiprocessing library so my understanding is limited.

The other issue is that this has been very difficult to reproduce. It is very consistently failing under our CI pipeline conditions but changing some elements avoids the issue:

always works fine on my laptop, either native macOS or with the same container.
we use a python process wrapper (calls pylint, mypy, green ... through subprocess.run()) and calling green directly, the issue does not show up.
splitting the library which fails into smaller chunks and everything passes, so I was not able to identify if there is a 'conflict' with some of our code.

Let me know if the data I dumped helps you and if there are other things you would like me to dump. Unfortunately I cannot share a reproducible case since the docker image and our code is private.

sodul commented 3 years ago

I'll pause my investigation for a few days. I tried many different combinations but I am still unable to pinpoint any specific test or combination of tests that trigger the issue. It seems that the number of tests might be a factor since we have a few hundred tests in that directory.

The next thing I will try is to revert the two threading patches added in python 3.9.6 and see if they are the ones triggering the issue.

CleanCut commented 3 years ago

Good news! I was able to reproduce the problem! 🎉

That diff you linked to was super helpful. You were correct that the problem was because of the addition of _repopulate_pool_static, which broke repopulating pool subprocesses after they have died. Since green sets the processes to never die, this usually wouldn't be a problem, but somehow your workload is managing to kill all of the subprocesses -- and then since the pool has no workers pulling jobs from the queue, everything just hangs forever.

I was able to reproduce the problem in aa572e1a1e40e091882a565a9200e3242453cf22 by adding a -X, --max--maxtasksperchild option and setting that to 1, and then trying to run more suites than subprocesses that get launched. I hit a hang in Python 3.8 and 3.9, but Python 3.7 works fine.

Now to fix the problem...

CleanCut commented 3 years ago

Fixed in Green 3.3.0, which I just released. 🥳 🎈 🎉

sodul commented 3 years ago

Thanks a lot for the quick fix!!! I'll try 3.3.0 in our CI pipeline tomorrow and will let you know how it goes.

CleanCut commented 3 years ago

🤞🏻

sodul commented 3 years ago

I've tested green 3.3.0 and I can confirm that the AssertionError stack trace is gone, however the tests are still stuck in parallel mode and never return. Passing the -s 1 option to green bypasses the problem so we are not stuck but there is still a problem somewhere.

I'll try to run some more experiments over the weekend. I'm wondering if adding a timeout per test might help me pinpoint which test might cause the situation to trigger.

The good new is that thanks to your new CLI option I was able to reproduce on my laptop under macOS which will help a lot. green -r -v -s 32 --maxtasksperchild 1. This will allow me to cycle through experiments much faster.

Here is what I suspect might be happening. We run our CI under Kubernetes where the container is not getting the full CPU capacity of the host, but many processes in the container still believe the process has access to all the CPU/RAM resources. Without -s being set green is probably setting a number of cores (16) the host is claiming but the container is only allocated 4 k8s CPUs. Setting -s 4 seems to fix the issue as well and TBH doing more than 4 is way overkill so I'll set that as the default to pass to green from our wrapper.

I did confirm that os.cpu_count(), which Pool() calls when process is None, is returning 16 on our CI pipelines.

As far as I'm concern green is now even better today than it was last week :-) Thanks a lot.

CleanCut commented 3 years ago

Here is what I suspect might be happening. We run our CI under Kubernetes where the container is not getting the full CPU capacity of the host, but many processes in the container still believe the process has access to all the CPU/RAM resources. Without -s being set green is probably setting a number of cores (16) the host is claiming but the container is only allocated 4 k8s CPUs. Setting -s 4 seems to fix the issue as well and TBH doing more than 4 is way overkill so I'll set that as the default to pass to green from our wrapper.

Oh, good. I'm glad you found a workaround. If we can find both a) a way to determine that we are running in k8s (an environment variable that k8s always sets, maybe?), AND b) a way for Python to determine a better CPU number in k8s, then we could have green adjust it's auto cpu count under k8s.

If we do go down this path, let's open a new issue to have the discussion in.

As far as I'm concern green is now even better today than it was last week :-) Thanks a lot.

You are most welcome!

CleanCut / green

Python 3.9.6 threading compatibility #250