department-of-veterans-affairs / notification-api

Notification API
MIT License
16 stars 9 forks source link

BUG: Email Delivered Status with Failure Status Reason #1630

Closed mjones-oddball closed 9 months ago

mjones-oddball commented 9 months ago

Description

Beverly was pulling stats in BigQuery and Domo and noticed that some delivered notifications are showing failure status reasons.

Steps to Reproduce

  1. Go to Prod Statistics in BigQuery
  2. Find all instances of notifications where the status is delivered and the status reason isn't blank
    SELECT
    date,
    service_name,
    template_name,
    status,
    status_reason
    FROM
    `vsp-analytics-and-insights.platform_vanotify.prod-statistics`
    WHERE
    status = "delivered"
    AND status_reason IS NOT null
  3. See Results

Example results below

image.png

You can tell that this is only happening with hard and soft bounces by returning a distinct status_reason list.

SELECT
  DISTINCT status_reason
FROM
  `vsp-analytics-and-insights.platform_vanotify.prod-statistics`
WHERE
  status = "delivered"
  AND status_reason IS NOT null

Results:

Impact/Urgency

High - We cannot know If notifications are inaccurately being set to delivered until we investigate the root cause of the status/status reason combination.

Expected Behavior

3 pt limit, please keep the team informed if this becomes more complex than expected so we can re-evaluate next steps/additional tickets

Checklist

QA Considerations

For QA to populate. Leave blank if QA is not applicable on this ticket.

We don't yet know how this was caused to know how to replicate it.

Additional Info & Resources

mjones-oddball commented 9 months ago

Hey team! Please add your planning poker estimate with Zenhub @cris-oddball @EvanParish @k-macmillan @kalbfled @ldraney @mchlwellman

mjones-oddball commented 9 months ago

@bevnobev @cris-oddball Update on this bug: @k-macmillan and I discussed. SES does async processing. Soft bounces are retried by SES automatically. We trust the delivered status and should ensure it's honored in this case. In the hard bounce case only DNS lookup failures are retried by SES so we are a little unclear if we can/should trust delivered and will escalate a production impact ticket to Amazon for them to confirm before we decide. It's likely the same case as soft bounce due to async processing, but just want to be 100% as hard bounces are usually permanent failures. See more info here: https://docs.aws.amazon.com/ses/latest/dg/send-email-concepts-deliverability.html and

Bounces can also be synchronous or asynchronous. A synchronous bounce occurs while the email servers of the sender and receiver are actively communicating. An asynchronous bounce occurs when a receiver initially accepts an email message for delivery and then subsequently fails to deliver it to the recipient.

k-macmillan commented 9 months ago

Identified where in the code this happening. We get a status of delivered, then we get a bounce status update from AWS. We usually consider delivered to be the final resting point of a message, but that is not always the case. Will discuss with @mjones-oddball tomorrow.

image.png
k-macmillan commented 9 months ago

Hard bounces are confirmed failures and should have their status changed to permanent-failure. I have asked for clarification on soft bounces so we can determine a path forward for those.

k-macmillan commented 9 months ago

Soft bounces may also be considered an end-state for that notification attempt. Handled both hard/soft bounces and prevented updating the status back to delivered if there is a bounced status, per AWS discussion.

cris-oddball commented 9 months ago

PR approved and merged. Due to the complexity of integration testing, replying on unit tests for this one. Sent to Perf, where a regression will be run.

cris-oddball commented 9 months ago

Regression passed. @k-macmillan @mjones-oddball are we doing any data clean-up on a second ticket or on this ticket? If a second ticket I will hold this here in QA until I get the new ticket linked here.

cris-oddball commented 9 months ago

No need to hold this ticket for the clean-up. Closing as done.