Agent Controller Informer cache build fails with a timeout

AbeOwlu commented 7 months ago

What happened: node agent fails to start on a new node - and it shows up as a widespread failure on the node with all pods failing to get a network built for pod sandboxes

Attach logs

"level":"info","ts":"2024-04-23T18:10:26.313Z","logger":"controllers.policyEndpoints","caller":"runtime/proc.go:267","msg":"ConntrackTTL","cleanupPeriod":300}
"level":"info","ts":"2024-04-23T18:10:26.313Z","logger":"setup","caller":"runtime/asm_amd64.s:1650","msg":"starting manager"}
"level":"info","ts":"2024-04-23T18:10:26.313Z","logger":"controller-runtime.metrics","caller":"runtime/asm_amd64.s:1650","msg":"Starting metrics server"}
"level":"info","ts":"2024-04-23T18:10:26.313Z","caller":"runtime/asm_amd64.s:1650","msg":"starting server","kind":"health probe","addr":"[::]:8163"}
"level":"info","ts":"2024-04-23T18:10:26.313Z","logger":"setup","msg":"Serving metrics on ","port":61680}
"level":"info","ts":"2024-04-23T18:10:26.314Z","caller":"manager/runnable_group.go:223","msg":"Starting EventSource","controller":"policyendpoint","controllerGroup":"networking.k8s.aws","controllerKind":"PolicyEndpoint","source":"kind source: *v1alpha1.PolicyEndpoint"}
"level":"info","ts":"2024-04-23T18:10:26.314Z","caller":"manager/runnable_group.go:223","msg":"Starting Controller","controller":"policyendpoint","controllerGroup":"networking.k8s.aws","controllerKind":"PolicyEndpoint"}
"level":"info","ts":"2024-04-23T18:10:26.313Z","logger":"controller-runtime.metrics","caller":"runtime/asm_amd64.s:1650","msg":"Serving metrics server","bindAddress":":8162","secure":false}
"level":"error","ts":"2024-04-23T18:10:56.315Z","logger":"controller-runtime.source.EventHandler","caller":"wait/loop.go:50","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: networking.k8s.aws/v1alpha1: Get \"https://172.20.0.1:443/apis/networking.k8s.aws/v1alpha1\": dial tcp 172.20.0.1:443: i/o timeout"}
"level":"error","ts":"2024-04-23T18:11:26.315Z","logger":"controller-runtime.source.EventHandler","caller":"wait/loop.go:74","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: networking.k8s.aws/v1alpha1: Get \"https://172.20.0.1:443/apis/networking.k8s.aws/v1alpha1\": dial tcp 172.20.0.1:443: i/o timeout"}
"level":"error","ts":"2024-04-23T18:12:06.317Z","logger":"controller-runtime.source.EventHandler","caller":"wait/loop.go:74","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: networking.k8s.aws/v1alpha1: Get \"https://172.20.0.1:443/apis/networking.k8s.aws/v1alpha1\": dial tcp 172.20.0.1:443: i/o timeout"}
"level":"error","ts":"2024-04-23T18:12:26.314Z","caller":"controller/controller.go:234","msg":"Could not wait for Cache to sync","controller":"policyendpoint","controllerGroup":"networking.k8s.aws","controllerKind":"PolicyEndpoint","error":"failed to wait for policyendpoint caches to sync: timed out waiting for cache to be synced for Kind *v1alpha1.PolicyEndpoint"}
"level":"info","ts":"2024-04-23T18:12:26.314Z","msg":"Stopping and waiting for non leader election runnables"}
"level":"info","ts":"2024-04-23T18:12:26.314Z","msg":"Stopping and waiting for leader election runnables"}
"level":"info","ts":"2024-04-23T18:12:26.314Z","msg":"Stopping and waiting for caches"}
"level":"error","ts":"2024-04-23T18:12:36.324Z","logger":"controller-runtime.source.EventHandler","caller":"wait/loop.go:74","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: networking.k8s.aws/v1alpha1: Get \"https://172.20.0.1:443/apis/networking.k8s.aws/v1alpha1\": dial tcp 172.20.0.1:443: i/o timeout"}
"level":"info","ts":"2024-04-23T18:12:36.324Z","msg":"Stopping and waiting for webhooks"}
"level":"info","ts":"2024-04-23T18:12:36.324Z","msg":"Stopping and waiting for HTTP servers"}
"level":"info","ts":"2024-04-23T18:12:36.324Z","logger":"controller-runtime.metrics","msg":"Shutting down metrics server with timeout of 1 minute"}
"level":"info","ts":"2024-04-23T18:12:36.324Z","msg":"shutting down server","kind":"health probe","addr":"[::]:8163"}
"level":"info","ts":"2024-04-23T18:12:36.324Z","msg":"Wait completed, proceeding to shutdown the manager"}
"level":"error","ts":"2024-04-23T18:12:36.324Z","logger":"setup","caller":"runtime/asm_amd64.s:1650","msg":"problem running manager","error":"failed to wait for policyendpoint caches to sync: timed out waiting for cache to be synced for Kind *v1alpha1.PolicyEndpoint"}

What you expected to happen: normal startup... or failed node agent container should not lead to failed aws-node pod causing CNI issues

How to reproduce it (as minimally and precisely as possible): Unable to reproduce - occurred only on 3 nodes on impacted cluster

Anything else we need to know?: should node-agent be an opt in for cluster not using networkpolicy resources?

Environment: as provided with custom networking enabled

Kubernetes version (use kubectl version): 1.29
CNI Version 1.16.2
Network Policy Agent Version 1.1.0
OS (e.g: cat /etc/os-release): RHEL (enterprise linux 8)
Kernel (e.g. uname -a): 4.18.0-513.18.1.el8_9.x86_64

achevuru commented 5 months ago

@ AbeOwlu It looks like we're running in to API server access issues on these nodes. Can you try accessing the API server from the problematic nodes? We should see an issue even with CNI pods. Also, please check the status of kube-proxy pods on these nodes..

AbeOwlu commented 5 months ago

Thanks @achevuru .We can go ahead and close this. The network manager on the AL23 based AMI built some weird network configurations, net forwarding was switched off etc.

achevuru commented 5 months ago

Ok, thanks for the update.

aws / aws-network-policy-agent

Agent Controller Informer cache build fails with a timeout #260