department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
283 stars 205 forks source link

Replace anomalous monitors #78563

Closed rmtolmach closed 6 months ago

rmtolmach commented 8 months ago

Summary

It was brought to our attention by Kyle that Anomaly Monitors can be flakey and shouldn't be used. A number of new anomaly monitors were recently added to the Platform Product team. Find them by searching the devops repo for anomalies (not all of those are ours, though). We need to use a different metric in DataDog (we need a metric alert).

Anomaly monitors are based on learning what's considered "normal" behavior. In dynamic envs where normal is constantly changing (think external services in our case), anomaly monitors might generate a high rate of false positives, being too noisy for behavior that's expected.

Tasks

Success Metrics

All of the anomaly monitors our team owns are metrics monitors and are being used (not muted).

Acceptance criteria

rmtolmach commented 7 months ago

It looks like it's not possible to edit an anomaly monitor. When you click edit and change the detection method, it changes it to a new monitor. So I'm creating a new monitor. The one I'm currently working on is the Vets API scaling monitor (replacing this one).

jennb33 commented 7 months ago

One monitor has been started, and due to support coverage, this ticket is being moved out to Sprint 52.

rmtolmach commented 7 months ago

We have this non-anomalous monitor: VETS API - PROD - Check HPA Metrics and the anomalous monitor: VETS API - PROD - EKS IS NOT SCALING! and they appear to do similar things. I think we can delete the the anomalous one and still have the same amount of coverage.

jennb33 commented 6 months ago

Per @rmtolmach, there are about a dozen monitors left to replace.

rmtolmach commented 6 months ago
rmtolmach commented 6 months ago
rmtolmach commented 6 months ago

ClamAV Errors

Update on the Prod and Staging ClamAV monitors I created. The staging one seems fine. The Prod one triggers during the vets-api deploy at 1pm ET and then resolves itself.

  1. WARNING: remote_cvdhead: Malformed CVD header (too short)
  2. WARNING: DNS Update Info disabled. Falling back to HTTP mode.
  3. WARNING: Cool-down expired, ok to try again.

Those same three warnings occur during non-deploy times, only in smaller numbers. I think we're safe to ignore the errors. We can set up a recurring mute for deploy time for 30 min.

rmtolmach commented 6 months ago

High CPU Usage

The query we're using is avg:container.cpu.usage{kube_namespace:vets-api} by {container_name,env} which is in nanoseconds. I'm unclear how many nanoconds means "high CPU usage". There's also container.cpu.limit which is in nanocores. I had initially thought that dividing usage/limit would give us a percentage, but dividing nanoconds by nanocores doesn't seem right.

πŸ• Datadog metrics explorer where I've been playing around.

Some options:

  1. Delete the two we have in favor of VETS API - PROD - RDS - High CPU Utilization. Is one monitor good enough or do we need one for all containers in vets-api?
  2. Leave them as-is (alive but muted) so we're able to see them from the Backend monitor dashboard. They don't seem overly-flakey but idk if we want to send alerts.
  3. Convert query to a metric monitor and use the same query but alert on a fixed (arbitrary?) amount of seconds. (But different containers have different limits so this makes it hard to choose a fixed number.)
  4. divide usage/limit
rmtolmach commented 6 months ago

I have created this new metric monitor: VETS API - {{env.name}} - {{kube_deployment.name}} - {{container_name.name}} - High CPU Utilization. I will watch it and modify it as needed. The query we (Chris, Lindsey and I) ended up with was:

(
avg:kubernetes.cpu.usage.total{kube_namespace:vets-api} by {container_name,env,kube_deployment} /
avg:container.cpu.limit{kube_namespace:vets-api} by {container_name,env,kube_deployment}
)* 100

Still to-do: After we've vetted our new monitors in the UI, export them to terraform then get PR reviewed, merged, and tf applied.

LindseySaari commented 6 months ago

Keeping an eye on the monitor, it's live in the UI but the calculation seems like it's not correct

rmtolmach commented 6 months ago

Keeping an eye on the monitor, it's live in the UI but the calculation seems like it's not correct

After many hours troubleshooting, the problem seemed to fix itself! The History graph and the Edit graph were very different. I was messing around and changed the threshold from 85 to 20 and the monitor appeared to come to life. The two graphs were reporting the same numbers! πŸŽ‰ I've bumped the threshold to 80 and it's looking good. Slack troubleshooting thread.

rmtolmach commented 6 months ago

SIDEKIQ Job queue size is high

This one used to be a metric monitor. I created a new temporary monitor in the UI to test the query: https://vagov.ddog-gov.com/monitors/233349?view=spans

rmtolmach commented 6 months ago

PR: https://github.com/department-of-veterans-affairs/devops/pull/14336

rmtolmach commented 6 months ago

Almost done with this! To do:

rmtolmach commented 6 months ago

All done! As a reminder VETS API - {{env.name}} - {{controller.name}} - High 500 Errors and VETS API - {{env.name}} - {{controller.name}} - High 400 Errors were kept as anomaly monitors for the following reasons: