census-instrumentation / opencensus-python

A stats collection and distributed tracing framework
Apache License 2.0
667 stars 248 forks source link

RuntimeError: maximum recursion depth exceeded if AzureLogHandler's queue gets full when the handler is installed in the root logger #1217

Open mgedmin opened 11 months ago

mgedmin commented 11 months ago

Describe your environment. Python 3.9 on Debian 11, opencensus 0.11.2, opensensus-context 0.1.3, opencensus-ext-azure 1.1.9, opencensus-ext-requests 0.8.0, Azure hosted Kubernetes. The logging subsystem is configured to log to sys.stdout (i.e. the Kubernetes console) as well as into Azure Application Insights. Everything at the DEBUG level is logged.

Steps to reproduce.

What is the expected behavior?

What is the actual behavior?

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/opencensus/common/schedule/__init__.py", line 135, in put
    self._queue.put(item, block, timeout)
  File "/usr/lib/python3.9/queue.py", line 137, in put
    raise Full
queue.Full

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/opencensus/common/schedule/__init__.py", line 135, in put
    self._queue.put(item, block, timeout)
  File "/usr/lib/python3.9/queue.py", line 137, in put
    raise Full
queue.Full

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/opencensus/common/schedule/__init__.py", line 135, in put
    self._queue.put(item, block, timeout)
  File "/usr/lib/python3.9/queue.py", line 137, in put
    raise Full
queue.Full

... (skipping a bunch of repetitions) ...

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/opencensus/common/schedule/__init__.py", line 135, in put
    self._queue.put(item, block, timeout)
  File "/usr/lib/python3.9/queue.py", line 137, in put
    raise Full
queue.Full

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/bolsendnotifications", line 8, in <module>
    sys.exit(main())
... (skipping some application-specific stack frames until we enter the logging subsystem) ...
  File "/usr/local/lib/python3.9/dist-packages/urllib3/connectionpool.py", line 456, in _make_request
    log.debug(
  File "/usr/lib/python3.9/logging/__init__.py", line 1430, in debug
    self._log(DEBUG, msg, args, **kwargs)
  File "/usr/lib/python3.9/logging/__init__.py", line 1585, in _log
    self.handle(record)
  File "/usr/lib/python3.9/logging/__init__.py", line 1595, in handle
    self.callHandlers(record)
  File "/usr/lib/python3.9/logging/__init__.py", line 1657, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.9/logging/__init__.py", line 948, in handle
    self.emit(record)
  File "/usr/local/lib/python3.9/dist-packages/opencensus/ext/azure/log_exporter/__init__.py", line 109, in emit
    self._queue.put(record, block=False)
  File "/usr/local/lib/python3.9/dist-packages/opencensus/common/schedule/__init__.py", line 137, in put
    logger.warning('Queue is full. Dropping telemetry.')
  File "/usr/lib/python3.9/logging/__init__.py", line 1454, in warning
    self._log(WARNING, msg, args, **kwargs)
  File "/usr/lib/python3.9/logging/__init__.py", line 1585, in _log
    self.handle(record)
  File "/usr/lib/python3.9/logging/__init__.py", line 1595, in handle
    self.callHandlers(record)
  File "/usr/lib/python3.9/logging/__init__.py", line 1657, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.9/logging/__init__.py", line 948, in handle
    self.emit(record)
  File "/usr/local/lib/python3.9/dist-packages/opencensus/ext/azure/log_exporter/__init__.py", line 109, in emit
    self._queue.put(record, block=False)
  File "/usr/local/lib/python3.9/dist-packages/opencensus/common/schedule/__init__.py", line 137, in put
    logger.warning('Queue is full. Dropping telemetry.')
  File "/usr/lib/python3.9/logging/__init__.py", line 1454, in warning
    self._log(WARNING, msg, args, **kwargs)
  File "/usr/lib/python3.9/logging/__init__.py", line 1585, in _log
    self.handle(record)
  File "/usr/lib/python3.9/logging/__init__.py", line 1595, in handle
    self.callHandlers(record)
  File "/usr/lib/python3.9/logging/__init__.py", line 1657, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.9/logging/__init__.py", line 948, in handle
    self.emit(record)
  File "/usr/local/lib/python3.9/dist-packages/opencensus/ext/azure/log_exporter/__init__.py", line 109, in emit
    self._queue.put(record, block=False)
  File "/usr/local/lib/python3.9/dist-packages/opencensus/common/schedule/__init__.py", line 137, in put
    logger.warning('Queue is full. Dropping telemetry.')
  File "/usr/lib/python3.9/logging/__init__.py", line 1454, in warning
    self._log(WARNING, msg, args, **kwargs)
  File "/usr/lib/python3.9/logging/__init__.py", line 1585, in _log
    self.handle(record)
  File "/usr/lib/python3.9/logging/__init__.py", line 1595, in handle
    self.callHandlers(record)
  File "/usr/lib/python3.9/logging/__init__.py", line 1657, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.9/logging/__init__.py", line 948, in handle
    self.emit(record)
  File "/usr/local/lib/python3.9/dist-packages/opencensus/ext/azure/log_exporter/__init__.py", line 109, in emit
    self._queue.put(record, block=False)
... (sipping a bunch of repeated stack frames like the above) ...
  File "/usr/local/lib/python3.9/dist-packages/opencensus/common/schedule/__init__.py", line 137, in put
    logger.warning('Queue is full. Dropping telemetry.')
  File "/usr/lib/python3.9/logging/__init__.py", line 1454, in warning
    self._log(WARNING, msg, args, **kwargs)
  File "/usr/lib/python3.9/logging/__init__.py", line 1585, in _log
    self.handle(record)
  File "/usr/lib/python3.9/logging/__init__.py", line 1595, in handle
    self.callHandlers(record)
  File "/usr/lib/python3.9/logging/__init__.py", line 1657, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.9/logging/__init__.py", line 948, in handle
    self.emit(record)
  File "/usr/lib/python3.9/logging/__init__.py", line 1079, in emit
    msg = self.format(record)
  File "/usr/lib/python3.9/logging/__init__.py", line 923, in format
    return fmt.format(record)
  File "/usr/local/lib/python3.9/dist-packages/bolfak/logging.py", line 105, in format
    return super().format(record)
  File "/usr/lib/python3.9/logging/__init__.py", line 662, in format
    s = self.formatMessage(record)
  File "/usr/lib/python3.9/logging/__init__.py", line 631, in formatMessage
    return self._style.format(record)
  File "/usr/local/lib/python3.9/dist-packages/bolfak/logging.py", line 69, in format
    fmt = copy.deepcopy(self._fmt)
  File "/usr/lib/python3.9/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.9/copy.py", line 205, in _deepcopy_list
    append(deepcopy(a, memo))
  File "/usr/lib/python3.9/copy.py", line 161, in deepcopy
    rv = reductor(4)
RecursionError: maximum recursion depth exceeded while calling a Python object

In our application we set sys.excepthook to a custom handler that logs the exception, which causes another flood of queue full warnings followed by a

Fatal Python error: _Py_CheckRecursiveCall: Cannot recover from stack overflow.

Additional context.

There's little to add. opencensus.common.schedule.Queue.put() calls logger.warning(), and when you use AzureLogHandler, logger.warning() calls back into Queue.put(), leading to infinite recursion. I suppose if the queue consumer could take items out of the queue faster than new queue full warnings are being logged, the recursion might end? But that's unlikely.

The bolfak/logging.py near the end of the stack trace is a custom log formatter that produces JSON logs by deepcopying a prototype log record structure (a list/dict of strings) and filling it in with values from the log record. It's not really interesting or relevant to the problem, it just happens to be the deepest part of the call stack where the recursion error is first caught.

lzchen commented 10 months ago

@mgedmin

The handler is not design to be added to the root logger since it would capture logs made by the handler itself as you have pointed out. Please use named loggers to track your application code.