department-of-veterans-affairs / abd-vro

To get Veterans benefits in minutes, VRO software uses health evidence data to help fast track disability claims.
Other
19 stars 6 forks source link

Custom Monitoring for Rabbit MQ and K8s termination grace period #2817

Closed msnwatson closed 2 weeks ago

msnwatson commented 5 months ago

User Story

As a VRO engineer, I want to take action on the findings from #2816 and close any availability or monitoring gaps, so that the platform's ability to offer service uptime for partners is improved.

Acceptance Criteria

  1. This ticket only requires request latency metrics instrumentation for RabbitMQ
  2. Request latency metrics for all services in VRO are analyzed and used to set K8s termination grace periods for all of our services.
  3. Solutions for any problems identified in #2816 are implemented and documented as necessary
  4. How would we test this? Can I see this metrics in datadog after I deploy to lower environments? You should be able to.
msnwatson commented 1 month ago

We'll need to go a slightly separate direction than the ticket had originally planned. It does not look like RabbitMQ supports standard metrics for the time messages sit in a queue.

Rather, I think we need to do two things which is set a TTL on messages in our queues and get metrics and alert on spikes in messages expiring within our queues. This will allow us to still set a termination grace period in a principled way while maintaining visibility into any negative effects of this policy change.

lisac commented 1 month ago

status: still working on resolving the blocker. details added to #3261. in summary: BIP metrics should now be visible; and I see Wednesday Aug 7 as a stretch goal for getting the metrics on the other apps working.

lisac commented 1 month ago

status: last round of changes on the blocker expected to be deployed 8/12.

lisac commented 1 month ago

status: I believe the blocker has been addressed. Request duration is being logged under these metrics: vro_xample_workflows.request_duration, vro_bie_kafka.request_duration, vro_bip.request_duration

msnwatson commented 1 month ago

Per this thread: https://dsva.slack.com/archives/C04QLHM9LR0/p1723848302659089 I took a slightly different approach than mentioned in the ticket based on the metrics that I was seeing. Still blocked on getting RabbitMQ metrics.