This is not a new issue, but an enhancement. AWS has been returning status updates for SMS with _SMS.FAILURE and a record_status of UNKNOWN (and a few others). Our system marks this as a temporary-failure which is exactly what AWS says these failures mean, but there is no retry in place. It appears the intent was that clients retry themselves. This is not a good system for clients so we should build retry functionality for these use cases. This happens at a rate of about 500 per day.
[ ] Ticket is understood, and QA has been contacted (if the ticket has a QA label).
User Story(ies)
As a Service
I want to not retry if VA Notify experiences a temporary failure after sending me a 201 response
So that my messages make it to the recipient
Additional Info and Resources
This is happening when we get status updates from AWS. That means the message has already been sent and we cannot use any of our existing retry functionality to resolve this. deliver_sms was successful and we have a provider reference. AWS attempted to send the message but failed and they expect us to retry. We get this notice during delivery status processing.
UNKNOWN – An error occurred that prevented the delivery of the message. This error is usually transient, and you can attempt to send the message again later.
So whenever the event_type is “_SMS.FAILURE” and record_status is any of “UNREACHABLE”, “UNKNOWN”, “CARRIER_UNREACHABLE”, “EXPIRED”. [...] customer application need to retry the SMS message delivery.
The send_notification_to_queue method is capable of retrying with minimal work; just need the notification object and to identify if the service is in research mode
Engineering Checklist
[ ] Retries are triggered if the event_type is _SMS.FAILURE and the record_status is one of the following: UNREACHABLE, UNKNOWN, CARRIER_UNREACHABLE, EXPIRED
[ ] A mechanism is implemented to prevent infinite retries
[ ] Unit tests cover all expected cases
Acceptance Criteria
[ ] event_type of _SMS.FAILURE is retried if the following record_status values are present: UNREACHABLE, UNKNOWN, CARRIER_UNREACHABLE, EXPIRED
[ ] There is a mechanism in place to limit the amount of retries
[ ] Status for _SMS.FALURE and the mentioned record_status is not set to temporary-failure so clients are not confused regarding their notification
QA Considerations
Given the message comes from AWS and we are responding to it there is no way to trigger these events. Testing will involve feeding lambda_functions/pinpoint_callback/pinpoint_callback_lambda.py fake data and seeing the retries happen.
User Story - Business Need
This is not a new issue, but an enhancement. AWS has been returning status updates for SMS with
_SMS.FAILURE
and a record_status ofUNKNOWN
(and a few others). Our system marks this as a temporary-failure which is exactly what AWS says these failures mean, but there is no retry in place. It appears the intent was that clients retry themselves. This is not a good system for clients so we should build retry functionality for these use cases. This happens at a rate of about 500 per day.User Story(ies)
As a Service I want to not retry if VA Notify experiences a temporary failure after sending me a 201 response So that my messages make it to the recipient
Additional Info and Resources
This is happening when we get status updates from AWS. That means the message has already been sent and we cannot use any of our existing retry functionality to resolve this.
deliver_sms
was successful and we have a providerreference
. AWS attempted to send the message but failed and they expect us to retry. We get this notice during delivery status processing.SMS events
AWS writeup
send_notification_to_queue
method is capable of retrying with minimal work; just need the notification object and to identify if the service is in research modeEngineering Checklist
_SMS.FAILURE
and the record_status is one of the following:UNREACHABLE
,UNKNOWN
,CARRIER_UNREACHABLE
,EXPIRED
Acceptance Criteria
_SMS.FAILURE
is retried if the followingrecord_status
values are present:UNREACHABLE
,UNKNOWN
,CARRIER_UNREACHABLE
,EXPIRED
_SMS.FALURE
and the mentioned record_status is not set totemporary-failure
so clients are not confused regarding their notificationQA Considerations
lambda_functions/pinpoint_callback/pinpoint_callback_lambda.py
fake data and seeing the retries happen.