Closed ccostino closed 2 weeks ago
There is one use case where this is perfectly legitimate:
You can see an example of this by searching on "b5df0cc9-6e63-582f-a9be-53d9c736bda2" which is the message_id of a notification that was initially not found, but then was found five minutes later.
We added some very important debugging capabilities in the logs during the week of July 4th, where we mapped the csv file name to the job_id and the job_id to the notification_id and the notification_id to the message_id, and ultimately displayed success or failure results from AWS (if it got that far). Which makes this very easy to research starting from that time, and we can see from the logs that since July 4th there have only be a couple occurrences of this and they were resolved when the necessary event popped into the AWS cloudwatch log and we were able to finish things.
What I think was happening prior to that time is simply the known case where a user sends a text message to a landline or fax machine. We we fail and retry 48 times. If we send a message to a mobile phone that is out of the area or not accepting calls, we get an immediate response that SNS marks as a Failure, but if we send messages to phones or devices that don't have texting capability we (and AWS) don't have a way to know if the device is reacting or not, so we get a "temporary failure" and go into retry mode.
I know that also in the case of extra commas corrupting CSV files and making it impossible to determine the phone number, it will also generate a lot of Retries. However, I don't think that's what is going on here because we specifically see the message "No event found for message_id" which means that AWS gave us a lookup code (message_id), but when we ask AWS to retrieve that message from the logs, it can't find it. So it sounds like 48 retries due to a landline or fax machine.
Awesome, thanks for investigating, @terrazoon! It sounds like all is functioning as expected given the scenario(s) then.
The only other question I have then is in your estimation of the code around these, is there any better error handling we ought to put in place to prevent full stack traces from being thrown, or is this the appropriate handling of this situation?
Caught up in Slack and there's nothing more to be done with this for now. Thanks again, @terrazoon!
This is one of the errors we've seen captured in New Relic that we'd like to dig into and understand, if not also resolve.
Error message: Retry in 300s Exception: celery.exceptions:Retry
There's a second, related error to this one:
Error message: Retry in 300s: NotificationTechnicalFailureException('No event found for message_id XXXXXXX notification_id XXXXXX')
Implementation Sketch and Acceptance Criteria
Security Considerations