bugsnag / bugsnag-python

Official BugSnag error monitoring and error reporting for django, flask, tornado and other python apps.
https://docs.bugsnag.com/platforms/python/
MIT License
84 stars 42 forks source link

BugSnag failed to notify on OOM with celery #372

Open ay2456 opened 7 months ago

ay2456 commented 7 months ago

Describe the bug

I'm running BugSnag with celery on Kubernetes pods. I've noticed that when the pod is out of memory with signal 9 (SIGKILL), BugSnag can't report the error:

Error logs:

[2024-01-24 21:45:01,221: ERROR/MainProcess] Process 'ForkPoolWorker-4' pid:341 exited with 'signal 9 (SIGKILL)'
[2024-01-24 21:45:01,232: ERROR/MainProcess] Signal handler <function failure_handler at 0x7fd8291ae830> raised: AttributeError("'str' object has no attribute 'tb_frame'")
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/billiard/pool.py", line 1265, in mark_as_worker_lost
    raise WorkerLostError(
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL) Job: 5.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/celery/utils/dispatch/signal.py", line 276, in send
    response = receiver(signal=self, sender=sender, **named)
  File "/opt/conda/lib/python3.10/site-packages/bugsnag/celery/__init__.py", line 14, in failure_handler
    bugsnag.auto_notify(exception, traceback=traceback,
  File "/opt/conda/lib/python3.10/site-packages/bugsnag/legacy.py", line 95, in auto_notify
    default_client.notify(
  File "/opt/conda/lib/python3.10/site-packages/bugsnag/client.py", line 84, in notify
    event = Event(
  File "/opt/conda/lib/python3.10/site-packages/bugsnag/event.py", line 107, in __init__
    stacktrace = self._generate_stacktrace(
  File "/opt/conda/lib/python3.10/site-packages/bugsnag/event.py", line 327, in _generate_stacktrace
    trace = traceback.extract_tb(tb)
  File "/opt/conda/lib/python3.10/traceback.py", line 72, in extract_tb
    return StackSummary.extract(walk_tb(tb), limit=limit)
  File "/opt/conda/lib/python3.10/traceback.py", line 364, in extract
    for f, lineno in frame_gen:
  File "/opt/conda/lib/python3.10/traceback.py", line 329, in walk_tb
    yield tb.tb_frame, tb.tb_lineno
AttributeError: 'str' object has no attribute 'tb_frame'
[2024-01-24 21:45:01,234: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL) Job: 5.')
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/billiard/pool.py", line 1265, in mark_as_worker_lost
    raise WorkerLostError(
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL) Job: 5.

Environment

clr182 commented 7 months ago

Hi @ay2456

Thanks for reaching out.

This is currently the expected behaviour as we have no mechanism in place to pre-allocate memory in the case of a Out of Memory exception nor do we currently have any error persistence in place.

I should note that we do not have an item on our backlog aimed at pre allocating this memory when the server starts so that when it does go down, it would still have some memory free to use and send the report.

We currently have no ETA on this functionality. Once we have an update on this we will be sure to share the additional information here.

Are you seeing this issue often? If so then it may be worth increasing the available memory for the pod as a temporary work around.