department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
284 stars 206 forks source link

[Datadog] Eval/Setup RED Monitoring for EKS:Clusters #49291

Closed ph-One closed 1 year ago

ph-One commented 2 years ago

Description

As a platform engineer Datadog monitors and alerts need to be evaluated and/or setup for EKS:Clusters

Acceptance Criteria

Notes

jhouse-solvd commented 1 year ago

Note: This issue will carry into the next sprint. We've made significant progress, but there are several refinements needed to ensure comprehensive monitoring for EKS (Clusters & Ingress).

Also, we consulted w/ Platform Tech Team 1 today to inform them of the monitoring that's been put in place for EKS. They are migrating Vets-API to EKS and will be adding monitors for parts of the application that are not yet covered.

it-harrison commented 1 year ago

coredns

kubelet

jhouse-solvd commented 1 year ago

Re: Kube proxy, I'm not sure if this is needed, but you might take a look: https://github.com/DataDog/integrations-core/blob/master/kube_proxy/README.md

ph-One commented 1 year ago

https://vagov.ddog-gov.com/monitors/115057

jhouse-solvd commented 1 year ago

https://vagov.ddog-gov.com/monitors/115078

https://vagov.ddog-gov.com/monitors/115068

https://vagov.ddog-gov.com/monitors/115081

https://vagov.ddog-gov.com/monitors/115082

https://vagov.ddog-gov.com/monitors/115086

it-harrison commented 1 year ago

kubelet

jhouse-solvd commented 1 year ago

This issue is nearly done but will stay open into the next sprint so that we can get it properly reviewed.

jhouse-solvd commented 1 year ago

Multiple team members have been out of the office recently, which has impacted our ability to complete this.

But now that @ph-One is back, he should be able to review and provide feedback, and then we'll get this over the line.

npeterson54 commented 1 year ago

Not quite complete, rolling into next sprint

npeterson54 commented 1 year ago

The Kublet duration is not making sense and needs to be investigated further. Potentially lets spin up another ticket.

it-harrison commented 1 year ago

Closing this ticket and breaking out the latency monitor into its own ticket.