CDCgov / prime-simplereport

SimpleReport is a fast, free, and easy way for COVID-19 testing facilities to report results to public health departments.
https://simplereport.gov
Creative Commons Zero v1.0 Universal
55 stars 58 forks source link

Create a function that will kill queued items with a dequeue_count over a certain threshold #3324

Open emmastephenson opened 2 years ago

emmastephenson commented 2 years ago

Updating as of 4/23/2024: Update to a SPIKE to determine if this is still an issue / something that we should implement. Investigate what would cause a message to be deque'd more than once and if we need to include some additional error handling.

Background

Part of the response to the 1/28-1/30 incident. During that incident, we knew that all messages that had ever been queued had really been successfully uploaded to ReportStream, but the only way to clear them was to manually remove individual messages. It would be preferable to have some kind of function that can remove all messages with a dequeue count above a certain threshold.

Action requested

Add a function (or script) to remove all messages from the queue over a certain threshold. The threshold should be a variable that can be entered as part of the function.

Additional context

We probably don't need this as a Github action (it hopefully won't be used frequently enough to warrant that) but we will need an easy way to run this during an incident, and it should get a playbook.

emmastephenson commented 1 year ago

This came up as an issue again with the recent invalid character upload issue. The specific ask here is that after 10 tries, the message should be moved to a separate "parking lot" queue.

emyl3 commented 2 weeks ago

When we dequeue a message, we change the visibility of the message from the queue when we call receiveMessage. The message is hidden from the queue for 1 hour (3600 seconds). If the message is sent to RS successfully, we delete it from the queue.

If the response is a 400 error back from RS, then we publish that message to an error queue and delete it from the original queue. If there is a different error or issue, the message reappears on the queue after 1 hour (which increments the dequeueCount for that message) and we try to send that message to RS again. When we delete the message, we log if we have dequeued a message more than once.

Within in 3 month period from 3/10/2024-6/10/2024 - a message has been dequeued more than once 46 times. For each of those, they all have been dequeued only 2 times at most. An example of why a message failed to upload to RS the first time, and had to dequeue more than once was because of the following error from RS:

Message: Error while trying to get the ReportStream auth token.

References

Thank you @mehansen and @DanielSass for helping me with this! Please let me know if I need to edit anything!

emyl3 commented 2 weeks ago

@DanielSass done with my comment. Will move this back to "Bugs and Tech Debt" column and add a "Needs refinement" tag.