Closed rmtolmach closed 6 months ago
It looks like it's not possible to edit an anomaly monitor. When you click edit and change the detection method, it changes it to a new monitor. So I'm creating a new monitor. The one I'm currently working on is the Vets API scaling monitor (replacing this one).
One monitor has been started, and due to support coverage, this ticket is being moved out to Sprint 52.
We have this non-anomalous monitor: VETS API - PROD - Check HPA Metrics and the anomalous monitor: VETS API - PROD - EKS IS NOT SCALING! and they appear to do similar things. I think we can delete the the anomalous one and still have the same amount of coverage.
Per @rmtolmach, there are about a dozen monitors left to replace.
204953
) in the dd UI and I have a branch with it deleted in terraform as well. (deleting b/c VETS API - PROD - Check HPA Metrics covers it).215822
) can be deleted. I'm replacing it with a two Log monitors. One for a higher env (prod) and one for a lower env (staging). Unfortunately, Datadog doesn't support conditional thresholds for log monitors π© so I can't create one monitor to trigger on different thresholds based on environment. I will monitor these two new monitors before committing it to terraform.
agile
now. I will monitor the 400 monitor to see if it stabilizes. (i've also updated the terraform on my branch to match my change in the UI)
kube_deployment
and one is by container_name
(and then there is a third in the UI onlyβHigh CPU Usage by podsβwe might want to delete that one?). I spent some time today trying to come up with something new. Update on the Prod and Staging ClamAV monitors I created. The staging one seems fine. The Prod one triggers during the vets-api deploy at 1pm ET and then resolves itself.
WARNING: remote_cvdhead: Malformed CVD header (too short)
WARNING: DNS Update Info disabled. Falling back to HTTP mode.
WARNING: Cool-down expired, ok to try again.
Those same three warnings occur during non-deploy times, only in smaller numbers. I think we're safe to ignore the errors. We can set up a recurring mute for deploy time for 30 min.
The query we're using is avg:container.cpu.usage{kube_namespace:vets-api} by {container_name,env}
which is in nanoseconds. I'm unclear how many nanoconds means "high CPU usage". There's also container.cpu.limit
which is in nanocores. I had initially thought that dividing usage/limit
would give us a percentage, but dividing nanoconds by nanocores doesn't seem right.
π Datadog metrics explorer where I've been playing around.
Some options:
I have created this new metric monitor: VETS API - {{env.name}} - {{kube_deployment.name}} - {{container_name.name}} - High CPU Utilization. I will watch it and modify it as needed. The query we (Chris, Lindsey and I) ended up with was:
(
avg:kubernetes.cpu.usage.total{kube_namespace:vets-api} by {container_name,env,kube_deployment} /
avg:container.cpu.limit{kube_namespace:vets-api} by {container_name,env,kube_deployment}
)* 100
Still to-do: After we've vetted our new monitors in the UI, export them to terraform then get PR reviewed, merged, and tf applied.
Keeping an eye on the monitor, it's live in the UI but the calculation seems like it's not correct
Keeping an eye on the monitor, it's live in the UI but the calculation seems like it's not correct
After many hours troubleshooting, the problem seemed to fix itself! The History graph and the Edit graph were very different. I was messing around and changed the threshold from 85
to 20
and the monitor appeared to come to life. The two graphs were reporting the same numbers! π I've bumped the threshold to 80
and it's looking good.
Slack troubleshooting thread.
This one used to be a metric monitor. I created a new temporary monitor in the UI to test the query: https://vagov.ddog-gov.com/monitors/233349?view=spans
Almost done with this! To do:
All done! As a reminder VETS API - {{env.name}} - {{controller.name}} - High 500 Errors and VETS API - {{env.name}} - {{controller.name}} - High 400 Errors were kept as anomaly monitors for the following reasons:
Summary
It was brought to our attention by Kyle that Anomaly Monitors can be flakey and shouldn't be used. A number of new anomaly monitors were recently added to the Platform Product team. Find them by searching the devops repo for
anomalies
(not all of those are ours, though). We need to use a different metric in DataDog (we need a metric alert).Anomaly monitors are based on learning what's considered "normal" behavior. In dynamic envs where normal is constantly changing (think external services in our case), anomaly monitors might generate a high rate of false positives, being too noisy for behavior that's expected.
Tasks
Success Metrics
All of the anomaly monitors our team owns are
metrics
monitors and are being used (not muted).Acceptance criteria