GSA / notifications-api

The API powering Notify.gov
Other
10 stars 1 forks source link

Exception Investigation: app.exceptions:NotificationTechnicalFailureException #1132

Closed ccostino closed 4 weeks ago

ccostino commented 2 months ago

This is one of the errors we've seen captured in New Relic that we'd like to dig into and understand, if not also resolve.

Error message: RETRY FAILED: Max retries reached. The task send_email_to_provider failed for notification XXXXXXXXXXXX. Notification has been updated to technical-failure Exception: app.exceptions:NotificationTechnicalFailureException

Traceback (most recent call last):
File "/home/vcap/deps/0/bin/celery", line 8, in <module>
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/__main__.py", line 15, in main
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/bin/celery.py", line 236, in main
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/click/core.py", line 1078, in main
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/click/core.py", line 783, in invoke
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/click/decorators.py", line 33, in new_func
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/bin/base.py", line 135, in caller
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/bin/worker.py", line 356, in worker
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/worker/worker.py", line 202, in start
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/bootsteps.py", line 116, in start
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/bootsteps.py", line 365, in start
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/concurrency/base.py", line 130, in start
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/concurrency/prefork.py", line 109, in on_start
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/concurrency/asynpool.py", line 464, in __init__
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/pool.py", line 1045, in __init__
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/concurrency/asynpool.py", line 482, in _create_worker_process
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/pool.py", line 1157, in _create_worker_process
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/process.py", line 120, in start
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/context.py", line 331, in _Popen
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/popen_fork.py", line 22, in __init__
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/popen_fork.py", line 77, in _launch
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/process.py", line 323, in _bootstrap
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/process.py", line 110, in run
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/pool.py", line 291, in __call__
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/pool.py", line 361, in workloop
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/app/trace.py", line 651, in fast_trace_task
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/app/trace.py", line 453, in trace_task
File "/home/vcap/app/app/__init__.py", line 412, in __call__
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/newrelic/hooks/application_celery.py", line 123, in wrapper
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/app/trace.py", line 736, in __protected_call__
File "/home/vcap/app/app/celery/provider_tasks.py", line 202, in deliver_email

Implementation Sketch and Acceptance Criteria

Security Considerations

terrazoon commented 1 month ago

This happened many times on one day back on June 18th, and again (similar to other exceptions reports) when you drill down to the details of what was going on at the time on the system, the first thing you see in the chain of events is:

[2024-06-18 21:11:10 +0000] [7] [ERROR] Worker (pid:23741) was sent SIGKILL! Perhaps out of memory? [2024-06-18 21:11:10 +0000] [23752] [INFO] Booting worker with pid: 23752

I think the sequence of events might have been:

  1. There was some email notification that was being processed
  2. The notification tried to retrieve personalization info out of our in-memory cache
  3. But there was insufficient memory for the cache to work properly, so it retrieved a None instead of a JSON string
  4. The error complains that json.loads() cannot load a None
  5. This then retries for hours.

I recommend we set memory to at least 4 gb on production.

ccostino commented 1 month ago

@stvnrlly and I talked about this today, we're planning on doubling the memory in production, which would put it at 4 GB.

ccostino commented 4 weeks ago

We ended up not increasing the memory, however we did increase the number of worker instances for the celery workers in production. This should help greatly and especially in cases where the app is being redeployed or the platform is undergoing maintenance and the apps are restaged under the hood.

If we see this pop up again and additional errors like Worker (pid:23741) was sent SIGKILL! Perhaps out of memory?, then we'll need to do another memory increase with what we can.