department-of-veterans-affairs / abd-vro

To get Veterans benefits in minutes, VRO software uses health evidence data to help fast track disability claims.
Other
18 stars 6 forks source link

Incident Reported: RabbitMQ is processing messages with no TTL #3181

Closed bianca-rivera closed 1 month ago

bianca-rivera commented 1 month ago

July 11th, 2024 at 7:36 PM UTC Reported by: Employee Experience (EE)

Description: EP Merge sends requests to BIP/BGS and if those services are down the messages are still queued and reprocessed when the service comes back up, resulting in duplicate/unnecessary requests to downstream services.

Replication: Discovered after investigation into BGS outage where two claims had multiple duplicate notes added even though the job was completed in error without EP Merge having knowledge that the claim note was eventually added.

Application non-functional: No

Relevant Links:

dfitchett commented 1 month ago

There are two solutions to addressing this issue, one of them is more of a temporary fix, and should be removed once the more permanent fix is place:

  1. Temporarily set the per-message TTL to 0 on queues using a policy set in the command line of rabbitmq pod in prod. This will not require a redeploy of EP Merge or other applications. There can be some complexities with applying the policy when there are currently messages in the queue (see here).
  2. update lib-hoppy to allow clients to set the per-message TTL using the expiration property upon publishing the message, then update EP Merge to use the updated version of hoppy, and redeploy.

Some useful links:

nelsestu commented 1 month ago

I met with Derek this morning and we discussed both the short term and long term solutions to the issue. Based on the status of the long term fix awaiting merge approval, we decided to prioritize deploying the long term solution. With any luck we'll get this deployed today and avoid the multi-step short term solution which would involve 1. connect to production rabbitmq via terminal 2. apply the temporary solution policy. At that point we'd be done applying the temp solution, and could proceed with the long term solution. Once the long term solution was merged, we could deploy that and then disable the short term. Hopefully it is clear why we are aiming to proceed with the long term solution only. Derek is going to let me know once his PR has been merged and I will proceed with the deployment

dfitchett commented 1 month ago

Current PR waiting for approval:

nelsestu commented 1 month ago

Finally got this fix deployed to production after some SecRel delays.