cds-snc / notification-planning-core

Project planning for GC Notify Core Team
0 stars 0 forks source link

pod counts no longer show up in System Overview dashboard #398

Closed sastels closed 5 days ago

sastels commented 3 months ago

Describe the bug

The pod counts don't show up here anymore!

Looks like the graphs stopped having data on July 25.

Bug Severity

See examples in the documentation

SEV-3 Minor - no affect on system but we can't monitor how it's scaling

To Reproduce

  1. Go to the Notify System Overview dashboard

  2. See that there's no pod count data

Expected behavior

Data is there

Impact

Cannot easily monitor the system

Screenshots

image.png
sastels commented 3 months ago

Last data on staging is at 15:09 EST on July 24. This manifest PR Paring down kube state metrics was merged at that same minute.

This PR adds an allow list for the k8s metrics

metricAllowlist:
  - pods=[*]
  - nodes=[*]
  - deployments=[*]

Unfortunately the pod counts per deployment are not in there, thought the node count is. However the node count is coming from eks I think.

The metric we want is kube_deployment_status_replicas_available. According to the docs it needs a namespace? Maybe we need to add

- namespaces=[notification-canada-ca]

to this list

sastels commented 3 months ago

It looks like maybe we're not getting any of the metrics we used to...

sastels commented 3 months ago

This PR trying to allow all namespaces didn't do anything :disappointed:

This is how the kube-state-metrics pod starts. I see that there's an explicit "Using all namespaces" before the allowlist details, so it we likely don't have to mention namespaces.

I0801 17:43:14.752112       1 wrapper.go:120] "Starting kube-state-metrics"                                                                                                                                                                                                                                                           │
│ W0801 17:43:14.752422       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.                                                                                                                                                                                │
│ I0801 17:43:14.752586       1 server.go:199] "Used resources" resources=["leases","poddisruptionbudgets","endpoints","ingresses","jobs","persistentvolumes","replicationcontrollers","storageclasses","persistentvolumeclaims","pods","services","configmaps","mutatingwebhookconfigurations","networkpolicies","secrets","validating │
│ I0801 17:43:14.752625       1 types.go:227] "Using all namespaces"                                                                                                                                                                                                                                                                    │
│ I0801 17:43:14.752637       1 types.go:145] "Using node type is nil"                                                                                                                                                                                                                                                                  │
│ I0801 17:43:14.752717       1 server.go:226] "Metric allow-denylisting" allowDenyStatus="Including the following lists that were on allowlist: pods=[*], nodes=[*], deployments=[*], namespaces=[*]"                                                                                                                                  │
│ W0801 17:43:14.752747       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.                                                                                                                                                                                │
│ I0801 17:43:14.753188       1 utils.go:70] "Tested communication with server"                                                                                                                                                                                                                                                         │
│ I0801 17:43:14.763046       1 utils.go:75] "Run with Kubernetes cluster version" major="1" minor="30+" gitVersion="v1.30.2-eks-db838b0" gitTreeState="clean" gitCommit="04088714581f0ad0a9e2c81c6ecc36bdd30d4b53" platform="linux/amd64"                                                                                              │
│ I0801 17:43:14.763065       1 utils.go:76] "Communication with server successful"                                                                                                                                                                                                                                                     │
│ I0801 17:43:14.764078       1 server.go:350] "Started metrics server" metricsServerAddress=":8080"                                                                                                                                                                                                                                    │
│ I0801 17:43:14.764290       1 server.go:73] levelinfomsgListening onaddress:8080                                                                                                                                                                                                                                                      │
│ I0801 17:43:14.764302       1 server.go:73] levelinfomsgTLS is disabled.http2falseaddress:8080                                                                                                                                                                                                                                        │
│ I0801 17:43:14.764340       1 metrics_handler.go:99] "Autosharding disabled"                                                                                                                                                                                                                                                          │
│ I0801 17:43:14.765003       1 server.go:339] "Started kube-state-metrics self metrics server" telemetryAddress=":8081"                                                                                                                                                                                                                │
│ I0801 17:43:14.765060       1 server.go:73] levelinfomsgListening onaddress:8081                                                                                                                                                                                                                                                      │
│ I0801 17:43:14.765072       1 server.go:73] levelinfomsgTLS is disabled.http2falseaddress:8081                                                                                                                                                                                                                                        │
│ I0801 17:43:14.766208       1 builder.go:282] "Active resources" activeStoreNames="certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes │
sastels commented 3 months ago

Reverting Ben's PR restored the pod metrics. Staging better now. Presumably production will be as well after this gets released.

P0NDER0SA commented 3 months ago

QA'ed! all good. moving to done