defenseunicorns / uds-core

A FOSS secure runtime platform for mission-critical capabilities
https://uds.defenseunicorns.com
GNU Affero General Public License v3.0
52 stars 21 forks source link

prometheus is missing container metrics from certain nodes #970

Open noahpb opened 3 weeks ago

noahpb commented 3 weeks ago

Environment

Device and OS: darwin arm64 App version: v0.29.1-unicorn Kubernetes distro being used: k3d with two nodes

Steps to reproduce

  1. Create a k3d cluster with additional nodes
    $ kubectl get node
    NAME               STATUS   ROLES                  AGE   VERSION
    k3d-agent1-0       Ready    <none>                 23m   v1.30.4+k3s1
    k3d-uds-server-0   Ready    control-plane,master   25m   v1.30.4+k3s1
  2. Deploy uds-core with monitoring

Expected result

Container metrics such as CPU and Memory utilization should be queryable

Actual Result

Prometheus only returns metrics from pods that are scheduled on control plane nodes

Visual Proof (screenshots, videos, text, etc)

Metrics returned for container_cpu_usage_seconds image

No metrics returned when filtering out control plane node: image

Severity/Priority

Moderate

Additional Context

Removing all NetworkPolicies in the monitoring namespace allows Prometheus to pick up metrics from the missing nodes.

joelmccoy commented 3 weeks ago

Internal related issue: https://github.com/defenseunicorns/uds-infrastructure/issues/573

noahpb commented 3 weeks ago

Thanks to @rjferguson21's suggestion, we've been able to confirm that the allow-prometheus-stack-egress-metrics-scraping NetworkPolicy generated by the operator needs to be adjusted. The remoteNamespace: "" specification is not permissive enough to allow egress traffic to the prometheus-node-exporter daemonset pods. Manually adjusting the egress specification of the NetworkPolicy to the CIDR range of the nodes worked in my local testing.

mjnagel commented 2 weeks ago

Would suggest to resolve this we build an AllNodes generated target. We should be able to build that list of IPs using a watch on the nodes with Pepr, similar to our KubeAPI target. This would also be helpful for metrics-server which has an Anywhere rule with a todo comment to switch that to an all nodes target.

Code links for current kubeapi logic:

Once this is added as a generated target we can add it to Prometheus and make sure that the traffic works as expected.