Enabling External Logging can result in a deadlock that prevents jobs from running

ISSUE TYPE

Bug Report

ENVIRONMENT

AWX version: 4.0.0+
AWX install method: any
Ansible version: any

STEPS TO REPRODUCE

AWX's task dispatcher manages a pool of worker processes to execute playbooks in the background when jobs are launched. Much like its predecessor, http://www.celeryproject.org, it forks new processes on demand as necessary to scale up when multiple jobs are running.

When external logging is enabled and configured to send logs to an HTTP/HTTPS endpoint (e.g., logstash, splunk), every AWX python process maintains a pool of background threads that emit log lines asynchronously to the destination log aggregator.

cpython (2.7 and 3.6) have a race condition on fork which can cause a deadlock when this external logging mechanism is in use:

https://bugs.python.org/issue37429

Due to nature of this race condition, and the very small critical path of the underlying locking mechanism in cpython, it's very difficult to observe this bug, but in the reproductions I've seen, it happens when:

External logging is enabled, particularly at the DEBUG level.
A number of simultaneous jobs are scheduled at nearly the same time.
The dispatcher scales up by forking several new worker processes.
Critically, this os.fork() occurs at nearly the same moment in time that one of the background threads is beginning to emit a log.

While it's fairly contrived, this code change will cause dispatcher workers to fork and then exit immediately. This means that if you throw many simultaneous tasks onto the message bus in short bursts, there will be a lot of process creation and destruction constantly going on, making the race condition described at https://bugs.python.org/issue37429 much more likely to occur:

~/dev/awx git diff awx/main/dispatch/
diff --git a/awx/main/dispatch/worker/base.py b/awx/main/dispatch/worker/base.py
index a48ed7d1d3..c18004a59d 100644
--- a/awx/main/dispatch/worker/base.py
+++ b/awx/main/dispatch/worker/base.py
@@ -119,6 +119,7 @@ class BaseWorker(object):
         ppid = os.getppid()
         signal_handler = WorkerSignalHandler()
         while not signal_handler.kill_now:
+            signal_handler.kill_now = True
             # if the parent PID changes, this process has been orphaned
             # via e.g., segfault or sigkill, we should exit too
             if os.getppid() != ppid:

The symptoms of this bug range in severity, but they generally mean that at least one process in the dispatcher's worker pool becomes permanently hung and unresponsive, and cannot be recovered without forcibly killing the process.

In the worst case scenario, the task which becomes hung is the global task manager; when this is hung, no new jobs can launch at all, and all launched jobs will remain in pending state until services are restarted.

Processes in this state can be discovered using awx-manage run_dispatcher --status. In the reproductions I've seen, the processes in question are always hung on a futex:

~ strace -s 100000 -ffttv -p 9268
strace: Process 9268 attached
11:06:36.426652 futex(0x478f800, FUTEX_WAIT_PRIVATE, 0, NULL

Closer inspection with gdb shows the actual Python stack trace for this deadlock:

~ yum install gdb
~ debuginfo-install python2-2.7.15-2.fc28.x86_64  # py2
~ debuginfo-install python3-3.6.5-1.fc28.x86_64  # py3

#4 Waiting for a lock (e.g. GIL)
#5 Waiting for a lock (e.g. GIL)
#7 Frame 0x7f8521b04bf0, for file /usr/lib64/python2.7/Queue.py, line 118, in put (self=<Queue(unfinished_tasks=2, queue=<collections.deque at remote 0x7f8521b5f1a0>, maxsize=0, all_tasks_done=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f8521c03950>, acquire=<built-in method acquire of thread.lock object at remote 0x7f8521c03950>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f8521c03950>) at remote 0x7f8521afb710>, mutex=<thread.lock at remote 0x7f8521c03950>, not_full=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f8521c03950>, acquire=<built-in method acquire of thread.lock object at remote 0x7f8521c03950>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f8521c03950>) at remote 0x7f8521afb650>, not_empty=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f8521c03950>, acquire=<built-in method acquire of thread.lock object at...(truncated)
    self.not_full.acquire()
#11 Frame 0x41f5d10, for file /var/lib/awx/venv/awx/lib/python2.7/site-packages/concurrent/futures/thread.py, line 129, in submit (self=<VerboseThreadPoolExecutor(_shutdown_lock=<thread.lock at remote 0x7f8521c03970>, _shutdown=False, _max_workers=2, _thread_name_prefix='ThreadPoolExecutor-0', _threads=set([<...>, <...>]), _work_queue=<...>) at remote 0x7f8521afbb10>, fn=<function at remote 0x7f8520a90e60>, args=('POST', u'https://splunkhf.app.vumc.org/services/collector/event'), kwargs={'verify': False, 'json': None, 'data': '{"event": {"cluster_host_id": "ans1001lp.hs.it.vumc.io", "level": "INFO", "tower_uuid": "", "@timestamp": "2019-04-18T14:47:34.069Z", "host": "ans1001lp", "logger_name": "awx.main.tasks", "message": "inventory_update 47961 (running) finished running, producing 7044 events.", "type": "splunk"}}', 'timeout': 5}, f=<Future(_exception=None, _result=None, _condition=<_Condition(_Condition__lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f8520a4c4...(truncated)
    self._work_queue.put(w)
#16 Frame 0x7f8521b63c20, for file /lib/python2.7/site-packages/awx/main/utils/handlers.py, line 79, in submit (self=<VerboseThreadPoolExecutor(_shutdown_lock=<thread.lock at remote 0x7f8521c03970>, _shutdown=False, _max_workers=2, _thread_name_prefix='ThreadPoolExecutor-0', _threads=set([<...>, <...>]), _work_queue=<...>) at remote 0x7f8521afbb10>, func=<function at remote 0x7f8520a90b18>, args=('POST', u'https://redacted/splunk/http/event/collector'), kwargs={'verify': False, 'json': None, 'data': '{"event": {"cluster_host_id": "ans1001lp.hs.it.vumc.io", "level": "INFO", "tower_uuid": "", "@timestamp": "2019-04-18T14:47:34.069Z", "host": "ans1001lp", "logger_name": "awx.main.tasks", "message": "inventory_update 47961 (running) finished running, producing 7044 events.", "type": "splunk"}}', 'timeout': 5}, _wrapped=<function at remote 0x7f8520a90e60>)
Python Exception <type 'exceptions.IOError'> [Errno 2] No such file or directory: '/lib/python2.7/site-packages/awx/main/utils/handlers.py':
Error occurred in Python command: [Errno 2] No such file or directory: '/lib/python2.7/site-packages/awx/main/utils/handlers.py'```

Further details on the specifics of this bug, and a longer gdb stack example of a simplified reproduction can be found at:

https://bugs.python.org/issue37429

Ultimately, I've got some ideas for a way to work around this bug in Python, but long-term I'd like to rethink how we've implemented the external logging functionality - including potentially moving the log shipping out of band into a separate background daemon so that we can prevent these types of critical language bugs going forward (the bug still appears present for me in a py3.7 I compiled this morning).

Until that time, given that this is a race condition bug in Python at fork time, I think our best option is simply to work around it. My current plan is to introduce a change which will cause the dispatcher's main thread to no longer ship external logs when the feature is enabled. This should help us avoid the fork race entirely, because the process we're forking shouldn't ever be entering the critical path for the lock in question.

ansible / awx