GSA / notifications-api

The API powering Notify.gov
Other
10 stars 1 forks source link

Exception Investigation: celery.exceptions:Retry #1136

Closed ccostino closed 2 weeks ago

ccostino commented 2 months ago

This is one of the errors we've seen captured in New Relic that we'd like to dig into and understand, if not also resolve.

Error message: Retry in 300s Exception: celery.exceptions:Retry

Traceback (most recent call last):
File "/home/vcap/deps/0/bin/celery", line 8, in <module>
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/__main__.py", line 15, in main
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/bin/celery.py", line 236, in main
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/click/core.py", line 1078, in main
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/click/core.py", line 783, in invoke
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/click/decorators.py", line 33, in new_func
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/bin/base.py", line 135, in caller
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/bin/worker.py", line 356, in worker
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/worker/worker.py", line 202, in start
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/bootsteps.py", line 116, in start
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/bootsteps.py", line 365, in start
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/concurrency/base.py", line 130, in start
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/concurrency/prefork.py", line 109, in on_start
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/concurrency/asynpool.py", line 464, in __init__
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/pool.py", line 1045, in __init__
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/concurrency/asynpool.py", line 482, in _create_worker_process
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/pool.py", line 1157, in _create_worker_process
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/process.py", line 120, in start
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/context.py", line 331, in _Popen
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/popen_fork.py", line 22, in __init__
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/popen_fork.py", line 77, in _launch
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/process.py", line 323, in _bootstrap
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/process.py", line 110, in run
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/pool.py", line 291, in __call__
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/pool.py", line 361, in workloop
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/app/trace.py", line 651, in fast_trace_task
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/app/trace.py", line 453, in trace_task
File "/home/vcap/app/app/__init__.py", line 412, in __call__
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/newrelic/hooks/application_celery.py", line 123, in wrapper
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/app/trace.py", line 736, in __protected_call__
File "/home/vcap/app/app/celery/provider_tasks.py", line 189, in deliver_email
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/app/task.py", line 757, in retry

There's a second, related error to this one:

Error message: Retry in 300s: NotificationTechnicalFailureException('No event found for message_id XXXXXXX notification_id XXXXXX')

Traceback (most recent call last):
File "/home/vcap/deps/0/bin/celery", line 8, in <module>
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/__main__.py", line 15, in main
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/bin/celery.py", line 236, in main
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/click/core.py", line 1078, in main
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/click/core.py", line 783, in invoke
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/click/decorators.py", line 33, in new_func
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/bin/base.py", line 135, in caller
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/bin/worker.py", line 356, in worker
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/worker/worker.py", line 202, in start
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/bootsteps.py", line 116, in start
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/bootsteps.py", line 365, in start
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/concurrency/base.py", line 130, in start
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/concurrency/prefork.py", line 109, in on_start
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/concurrency/asynpool.py", line 464, in __init__
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/pool.py", line 1045, in __init__
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/concurrency/asynpool.py", line 482, in _create_worker_process
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/pool.py", line 1157, in _create_worker_process
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/process.py", line 120, in start
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/context.py", line 331, in _Popen
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/popen_fork.py", line 22, in __init__
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/popen_fork.py", line 77, in _launch
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/process.py", line 323, in _bootstrap
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/process.py", line 110, in run
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/pool.py", line 291, in __call__
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/billiard/pool.py", line 361, in workloop
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/app/trace.py", line 651, in fast_trace_task
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/app/trace.py", line 453, in trace_task
File "/home/vcap/app/app/__init__.py", line 441, in __call__
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/newrelic/hooks/application_celery.py", line 123, in wrapper
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/app/trace.py", line 736, in __protected_call__
File "/home/vcap/app/app/celery/provider_tasks.py", line 63, in check_sms_delivery_receipt
File "/home/vcap/deps/0/python/lib/python3.12/site-packages/celery/app/task.py", line 757, in retry

Implementation Sketch and Acceptance Criteria

Security Considerations

terrazoon commented 1 month ago

There is one use case where this is perfectly legitimate:

  1. We send a message
  2. AWS, for whatever reason, takes more than 5 minutes to write to the log
  3. We retry, and then the event is in the log and everything concludes successfully.

You can see an example of this by searching on "b5df0cc9-6e63-582f-a9be-53d9c736bda2" which is the message_id of a notification that was initially not found, but then was found five minutes later.

terrazoon commented 1 month ago

We added some very important debugging capabilities in the logs during the week of July 4th, where we mapped the csv file name to the job_id and the job_id to the notification_id and the notification_id to the message_id, and ultimately displayed success or failure results from AWS (if it got that far). Which makes this very easy to research starting from that time, and we can see from the logs that since July 4th there have only be a couple occurrences of this and they were resolved when the necessary event popped into the AWS cloudwatch log and we were able to finish things.

What I think was happening prior to that time is simply the known case where a user sends a text message to a landline or fax machine. We we fail and retry 48 times. If we send a message to a mobile phone that is out of the area or not accepting calls, we get an immediate response that SNS marks as a Failure, but if we send messages to phones or devices that don't have texting capability we (and AWS) don't have a way to know if the device is reacting or not, so we get a "temporary failure" and go into retry mode.

I know that also in the case of extra commas corrupting CSV files and making it impossible to determine the phone number, it will also generate a lot of Retries. However, I don't think that's what is going on here because we specifically see the message "No event found for message_id" which means that AWS gave us a lookup code (message_id), but when we ask AWS to retrieve that message from the logs, it can't find it. So it sounds like 48 retries due to a landline or fax machine.

ccostino commented 1 month ago

Awesome, thanks for investigating, @terrazoon! It sounds like all is functioning as expected given the scenario(s) then.

The only other question I have then is in your estimation of the code around these, is there any better error handling we ought to put in place to prevent full stack traces from being thrown, or is this the appropriate handling of this situation?

ccostino commented 2 weeks ago

Caught up in Slack and there's nothing more to be done with this for now. Thanks again, @terrazoon!