department-of-veterans-affairs / notification-api

Notification API
MIT License
16 stars 9 forks source link

SPIKE for Performance issue with Delivery Status Processing Bug #1236

Closed jakehova closed 1 year ago

jakehova commented 1 year ago

BACKGROUND

We receive at least two delivery statuses back for every message that goes out. If a service is configured for a callback, we make a post request to the callback endpoint for each of those delivery statuses that come in.

ISSUE

The service is not processing the delivery statutes fast enough. As a result, the queue that feeds the task is filled with requests and callbacks arent being made for a significant time after the delivery status was received.

@k-macmillan provided some helpful background on his investigation so far: We sent an SMS request at 11am and didn't process the delivery status callback until 1pm: image (3).png

The Delivery status Lambda is receiving delivery statuses and is processing them quickly: image (2).png

SQS backs up with messages: image.png

From the logs, there doesn't seem to be any exceptions or errors being thrown by the service, celery task, or sqs that would explain the issue.

Acceptance Criteria

Out of Scope

Implementing a fix

mjones-oddball commented 1 year ago

Updated to high priority as it was determined to be urgent to investigate the cause. Timeboxed to a 3.

jakehova commented 1 year ago

Right now, investigating celery-sqs relationship as the culprit.

The average time between sending of an SMS and receiving the first delivery status is about 10 seconds to get a sent and about 10-20 more seconds to get a delivered status (loosely measured by using timestamps of when I click on a postman send button and the timestamp of receipt of the first delivery status in pipedream).

My current theory is that the combination of polling with how many items are dequeued during polling is leading to a natural backup of messages in the queue. I was looking in the celery documentation for modifying batch size and didn't find it yet. I read about CELERY_WORKER_PREFETCH_MULTIPLIER value but I'm not sure we can use that considering the container does multiple things and it may exhaust resources for delivery statuses.

What I'm going to look at tomorrow is:

  1. confirm the batch size of the dequeue of celery from sqs
  2. figure out how many twilio sms messages were sent over the course of a day
  3. use that number to determine a baseline number for how many delivery statuses we would've received
  4. calculate how long it would take celery with a polling interval of 1 to drain the queue.
jakehova commented 1 year ago

I have not confirmed my theory but I am more confident in it.

As expected peaks and valleys of SMS send delivery receipt follow one another. I believe that workers are being tied up with SMS sends and while those sends are happening, 2x delivery receipts are coming into the queue.

The delivery receipts queue doesn't get drained quickly because most of the available workers are working on sending SMS. This leads to the peak. When the SMS sends are complete, celery starts draining the delivery -receipts SQS.

Celery dequeues 4 messages (this is the default since we don't set it explicitly) per worker process at a time. This is the valley for the queue.

So everything is actually working the way it should. I'm not entirely sure we need to change anything right now because we could get these statuses up to 3 days later anyway, so I don't think that anybody waiting for a callback is expecting them urgently.

However, if we decide to change this, then the options are:

  1. Create a new service that has the responsibility of handling callbacks/webhooks. If the service is dedicated to these tasks then they'll happen quicker.
  2. Create a lambda that handles processing delivery status callbacks. We could disconnect celery from the queue and use the queue to trigger the lambda to handle the callback. The small downside of this is that we'd have to build in stuff for statsd that I don't think exists currently in our other lambdas.
jakehova commented 1 year ago

FYI: The metrics I really wanted to get were what worker is working on what task at any particular time. We don't have that insight right now and I think that should be something we build in to DataDog so that we can troubleshoot these kinds of issues in the future. As the system grows in functionality and the services handle more tasks, it will be really important. I can get with @ldraney and @k-macmillan to discuss this further.

jakehova commented 1 year ago

Kyle doubled the celery workers and it seems to have resolved the issue:

load increasing: lambda_load

sqs draining quickly: sqs_draining

the fact this worked seems to confirm my theory. We should have discussions of whether or not this increase in workers is a good idea long term or if breaking callbacks to another service is better. But, for now, the issue seems resolved.

mjones-oddball commented 1 year ago

@jakehova Can you please update the acceptance criteria checklist and provide the link to the new ticket please?