Closed jakehova closed 1 year ago
Updated to high priority as it was determined to be urgent to investigate the cause. Timeboxed to a 3.
Right now, investigating celery-sqs relationship as the culprit.
The average time between sending of an SMS and receiving the first delivery status is about 10 seconds to get a sent and about 10-20 more seconds to get a delivered status (loosely measured by using timestamps of when I click on a postman send button and the timestamp of receipt of the first delivery status in pipedream).
My current theory is that the combination of polling with how many items are dequeued during polling is leading to a natural backup of messages in the queue. I was looking in the celery documentation for modifying batch size and didn't find it yet. I read about CELERY_WORKER_PREFETCH_MULTIPLIER value but I'm not sure we can use that considering the container does multiple things and it may exhaust resources for delivery statuses.
What I'm going to look at tomorrow is:
I have not confirmed my theory but I am more confident in it.
As expected peaks and valleys of SMS send delivery receipt follow one another. I believe that workers are being tied up with SMS sends and while those sends are happening, 2x delivery receipts are coming into the queue.
The delivery receipts queue doesn't get drained quickly because most of the available workers are working on sending SMS. This leads to the peak. When the SMS sends are complete, celery starts draining the delivery -receipts SQS.
Celery dequeues 4 messages (this is the default since we don't set it explicitly) per worker process at a time. This is the valley for the queue.
So everything is actually working the way it should. I'm not entirely sure we need to change anything right now because we could get these statuses up to 3 days later anyway, so I don't think that anybody waiting for a callback is expecting them urgently.
However, if we decide to change this, then the options are:
FYI: The metrics I really wanted to get were what worker is working on what task at any particular time. We don't have that insight right now and I think that should be something we build in to DataDog so that we can troubleshoot these kinds of issues in the future. As the system grows in functionality and the services handle more tasks, it will be really important. I can get with @ldraney and @k-macmillan to discuss this further.
Kyle doubled the celery workers and it seems to have resolved the issue:
load increasing:
sqs draining quickly:
the fact this worked seems to confirm my theory. We should have discussions of whether or not this increase in workers is a good idea long term or if breaking callbacks to another service is better. But, for now, the issue seems resolved.
@jakehova Can you please update the acceptance criteria checklist and provide the link to the new ticket please?
BACKGROUND
We receive at least two delivery statuses back for every message that goes out. If a service is configured for a callback, we make a post request to the callback endpoint for each of those delivery statuses that come in.
ISSUE
The service is not processing the delivery statutes fast enough. As a result, the queue that feeds the task is filled with requests and callbacks arent being made for a significant time after the delivery status was received.
@k-macmillan provided some helpful background on his investigation so far: We sent an SMS request at 11am and didn't process the delivery status callback until 1pm:
The Delivery status Lambda is receiving delivery statuses and is processing them quickly:
SQS backs up with messages:
From the logs, there doesn't seem to be any exceptions or errors being thrown by the service, celery task, or sqs that would explain the issue.
Acceptance Criteria
Out of Scope
Implementing a fix