Replace anomalous monitors

rmtolmach commented 8 months ago

Summary

It was brought to our attention by Kyle that Anomaly Monitors can be flakey and shouldn't be used. A number of new anomaly monitors were recently added to the Platform Product team. Find them by searching the devops repo for anomalies (not all of those are ours, though). We need to use a different metric in DataDog (we need a metric alert).

Anomaly monitors are based on learning what's considered "normal" behavior. In dynamic envs where normal is constantly changing (think external services in our case), anomaly monitors might generate a high rate of false positives, being too noisy for behavior that's expected.

Tasks

[x] Replace the Product Team's anomaly monitors with other metrics.
[x] Monitor the new metrics and adjust accordingly, making sure they're not too sensitive.
[x] Once the metrics are in a healthy state, export the json to the devops repo so it can be saved in terraform.

Success Metrics

All of the anomaly monitors our team owns are metrics monitors and are being used (not muted).

Acceptance criteria

[x] Product team-owned anomaly monitors are replaced in datadog and terraform

rmtolmach commented 7 months ago

It looks like it's not possible to edit an anomaly monitor. When you click edit and change the detection method, it changes it to a new monitor. So I'm creating a new monitor. The one I'm currently working on is the Vets API scaling monitor (replacing this one).

jennb33 commented 7 months ago

One monitor has been started, and due to support coverage, this ticket is being moved out to Sprint 52.

rmtolmach commented 7 months ago

We have this non-anomalous monitor: VETS API - PROD - Check HPA Metrics and the anomalous monitor: VETS API - PROD - EKS IS NOT SCALING! and they appear to do similar things. I think we can delete the the anomalous one and still have the same amount of coverage.

jennb33 commented 6 months ago

Per @rmtolmach, there are about a dozen monitors left to replace.

rmtolmach commented 6 months ago

:x: I deleted "VETS API - PROD - EKS IS NOT SCALING!" (id: 204953) in the dd UI and I have a branch with it deleted in terraform as well. (deleting b/c VETS API - PROD - Check HPA Metrics covers it).
👀 VETS API - {{env.name}} - Clamav Errors (id: 215822) can be deleted. I'm replacing it with a two Log monitors. One for a higher env (prod) and one for a lower env (staging). Unfortunately, Datadog doesn't support conditional thresholds for log monitors 😩 so I can't create one monitor to trigger on different thresholds based on environment. I will monitor these two new monitors before committing it to terraform.
- Prod: https://vagov.ddog-gov.com/monitors/230930?view=spans
- Staging: https://vagov.ddog-gov.com/monitors/230932?view=spans
🤔 VETS API - {{env.name}} - {{controller.name}} - High 500 Errors. I'm tempted to un-mute this one and keep it as-is. For the 80+ groups this monitor reports on, this monitor has been triggered zero times in the past month. The anomaly detection might work for this situation.
VETS API - {{env.name}} - {{controller.name}} - High 400 Errors is a different story. I will work on it tomorrow.

rmtolmach commented 6 months ago

Update on the 400 error monitor: Upon comparing the queries of the 500 and 400 error monitors, I noticed the 400 query used a different algorithm, so I updated it with the algorithm the 500 one was using. So they're both using agile now. I will monitor the 400 monitor to see if it stabilizes. (i've also updated the terraform on my branch to match my change in the UI)
- Since both the 500 and 400 monitors track controllers that the Platform doesn't own, we don't necessarily want to alert on them, but it's good to be able to see if one controller is going haywire. So a possible solution to this is to take out the alerting component and just use it as informational.
There are two "High CPU Usage" monitors. One is by kube_deployment and one is by container_name (and then there is a third in the UI only–High CPU Usage by pods–we might want to delete that one?). I spent some time today trying to come up with something new.

rmtolmach commented 6 months ago

ClamAV Errors

Update on the Prod and Staging ClamAV monitors I created. The staging one seems fine. The Prod one triggers during the vets-api deploy at 1pm ET and then resolves itself.

WARNING: remote_cvdhead: Malformed CVD header (too short)
WARNING: DNS Update Info disabled. Falling back to HTTP mode.
WARNING: Cool-down expired, ok to try again.

Those same three warnings occur during non-deploy times, only in smaller numbers. I think we're safe to ignore the errors. We can set up a recurring mute for deploy time for 30 min.

rmtolmach commented 6 months ago

High CPU Usage

The query we're using is avg:container.cpu.usage{kube_namespace:vets-api} by {container_name,env} which is in nanoseconds. I'm unclear how many nanoconds means "high CPU usage". There's also container.cpu.limit which is in nanocores. I had initially thought that dividing usage/limit would give us a percentage, but dividing nanoconds by nanocores doesn't seem right.

🐕 Datadog metrics explorer where I've been playing around.

Some options:

Delete the two we have in favor of VETS API - PROD - RDS - High CPU Utilization. Is one monitor good enough or do we need one for all containers in vets-api?
Leave them as-is (alive but muted) so we're able to see them from the Backend monitor dashboard. They don't seem overly-flakey but idk if we want to send alerts.
Convert query to a metric monitor and use the same query but alert on a fixed (arbitrary?) amount of seconds. (But different containers have different limits so this makes it hard to choose a fixed number.)
divide usage/limit

rmtolmach commented 6 months ago

I have created this new metric monitor: VETS API - {{env.name}} - {{kube_deployment.name}} - {{container_name.name}} - High CPU Utilization. I will watch it and modify it as needed. The query we (Chris, Lindsey and I) ended up with was:

(
avg:kubernetes.cpu.usage.total{kube_namespace:vets-api} by {container_name,env,kube_deployment} /
avg:container.cpu.limit{kube_namespace:vets-api} by {container_name,env,kube_deployment}
)* 100

Still to-do: After we've vetted our new monitors in the UI, export them to terraform then get PR reviewed, merged, and tf applied.

LindseySaari commented 6 months ago

Keeping an eye on the monitor, it's live in the UI but the calculation seems like it's not correct

rmtolmach commented 6 months ago

Keeping an eye on the monitor, it's live in the UI but the calculation seems like it's not correct

After many hours troubleshooting, the problem seemed to fix itself! The History graph and the Edit graph were very different. I was messing around and changed the threshold from 85 to 20 and the monitor appeared to come to life. The two graphs were reporting the same numbers! 🎉 I've bumped the threshold to 80 and it's looking good. Slack troubleshooting thread.

[x] add the be-notification channel back in to the alert.

rmtolmach commented 6 months ago

SIDEKIQ Job queue size is high

This one used to be a metric monitor. I created a new temporary monitor in the UI to test the query: https://vagov.ddog-gov.com/monitors/233349?view=spans

rmtolmach commented 6 months ago

PR: https://github.com/department-of-veterans-affairs/devops/pull/14336

rmtolmach commented 6 months ago

Almost done with this! To do:

[x] Get this small follow-up PR merged and applied https://github.com/department-of-veterans-affairs/devops/pull/14343
[x] Chris is figuring out the case of the duplicated monitors. Slack thread.
[x] unmute VETS API SIDEKIQ - {{env.name}} - Job queue size is high
[x] Set up a recurring mute for the clamav production monitor during the daily deploy.
[x] Go through all of the new/modified/deleted monitors in the datadog UI and make sure they look accurate (exist or deleted or created).

rmtolmach commented 6 months ago

All done! As a reminder VETS API - {{env.name}} - {{controller.name}} - High 500 Errors and VETS API - {{env.name}} - {{controller.name}} - High 400 Errors were kept as anomaly monitors for the following reasons:

They are relatively stable.
They are muted but still show up on the Backend monitor dashboard. This seems appropriate because the monitor alerts on controller not owned by Platform, so it's more informational.

department-of-veterans-affairs / va.gov-team