Netpols block kubeapi in long lived EKS cluster

ntwkninja commented 3 weeks ago

Environment

Device and OS: Bottlerocket App version: 1.30 Kubernetes distro being used: AWS EKS Other:

Steps to reproduce

Deploy UDS Core with standard accoutrements
Wait a few days for API IPs to change
Try to do something that triggers an api action
Check metrics server, neuvector, monitoring, promtail, etc. for errors

Expected result

netpols for kubeapi update as AWS updates

Actual Result

kubepi addresses are not updated after being initially set

Visual Proof (screenshots, videos, text, etc)

NAMESPACE              LAST SEEN                  TYPE      REASON                    OBJECT                                          MESSAGE
istio-admin-gateway    31m (x56 over 21d)         Normal    SuccessfullyReconciled    Service/admin-ingressgateway                    Successfully reconciled
istio-login-gateway    31m (x55 over 21d)         Normal    SuccessfullyReconciled    Service/login-ingressgateway                    Successfully reconciled
istio-tenant-gateway   31m (x56 over 21d)         Normal    SuccessfullyReconciled    Service/tenant-ingressgateway                   Successfully reconciled
metrics-server         29m (x2451 over 4d19h)     Warning   Unhealthy                 Pod/metrics-server-59c9dddf69-8l4fk             Liveness probe failed: Get "http://100.64.75.152:15020/app-health/metrics-server/livez": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
metrics-server         4m1s (x29499 over 4d19h)   Warning   BackOff                   Pod/metrics-server-59c9dddf69-8l4fk             Back-off restarting failed container metrics-server in pod metrics-server-59c9dddf69-8l4fk_metrics-server(f619eae8-61d7-420c-a104-0c786e51242a)
istio-admin-gateway    2m47s (x3259 over 13h)     Warning   FailedGetResourceMetric   HorizontalPodAutoscaler/admin-ingressgateway    failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
istio-login-gateway    2m47s (x3259 over 13h)     Warning   FailedGetResourceMetric   HorizontalPodAutoscaler/login-ingressgateway    failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
istio-system           2m47s (x3259 over 13h)     Warning   FailedGetResourceMetric   HorizontalPodAutoscaler/istiod                  failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
istio-tenant-gateway   2m47s (x3259 over 13h)     Warning   FailedGetResourceMetric   HorizontalPodAutoscaler/tenant-ingressgateway   failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
keycloak               2m47s (x3259 over 13h)     Warning   FailedGetResourceMetric   HorizontalPodAutoscaler/keycloak                failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
zarf                   2m47s (x3259 over 13h)     Warning   FailedGetResourceMetric   HorizontalPodAutoscaler/zarf-docker-registry    failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)

Severity/Priority

Additional Context

# get new endpoint ips
IP1=$(kubectl get endpointslices.discovery.k8s.io -o json | jq -r '.items[0].endpoints[0] | select(.addresses != null) | .addresses[]' | head -n 1)
IP2=$(kubectl get endpointslices.discovery.k8s.io -o json | jq -r '.items[0].endpoints[1] | select(.addresses != null) | .addresses[]' | head -n 1)

mjnagel commented 3 weeks ago

Does this resolve itself after a pepr watcher pod restart? I think in the past we've seen this issue when pepr "stops watching" the endpoints.

We have also floated the idea of adding a config option for end users to specify a CIDR range instead of relying on the pepr watch. We should probably just add that at this point given the inconsistency seen with the watch.

ntwkninja commented 3 weeks ago

Does this resolve itself after a pepr watcher pod restart? I think in the past we've seen this issue when pepr "stops watching" the endpoints.

~~I'll try restarting the watcher and report back~~ That worked

defenseunicorns / uds-core