Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.95k stars 304 forks source link

konnectivity-agent pods spamming error messages #4423

Closed bvrxs2yyw9fq49n5 closed 2 weeks ago

bvrxs2yyw9fq49n5 commented 1 month ago

Hi all!

I created a 1.28.10 cluster. VNet is 10.1.0.0/16, network plugin is Azure with Overlay mode. Outbound type is userAssignedNATGateway, created NAT Gateway and assigned a route table to AKS node pool's subnet. Pod CIDR is 192.168.0.0/16. Default node pool launches nodes in subnet10.1.0.0/24`.

Everything seems to be working fine. I set up ingress, I can run my workloads, they are accessible, the thing that worries me is amount of log messages that konnectivity-agent pods generate. They seem to start fine:

I0719 13:51:09.906086       1 options.go:124] AgentCert set to "/certs/client.crt".
I0719 13:51:09.906148       1 options.go:125] AgentKey set to "/certs/client.key".
I0719 13:51:09.906155       1 options.go:126] CACert set to "/certs/ca.crt".
I0719 13:51:09.906160       1 options.go:127] ProxyServerHost set to "API_ADDRESS_HERE".
I0719 13:51:09.906166       1 options.go:128] ProxyServerPort set to 443.
I0719 13:51:09.906172       1 options.go:129] ALPNProtos set to [konnectivity].
I0719 13:51:09.906180       1 options.go:130] HealthServerHost set to
I0719 13:51:09.906185       1 options.go:131] HealthServerPort set to 8082.
I0719 13:51:09.906190       1 options.go:132] Admin bind address set to "127.0.0.1".
I0719 13:51:09.906195       1 options.go:133] AdminServerPort set to 8094.
I0719 13:51:09.906205       1 options.go:134] EnableProfiling set to false.
I0719 13:51:09.906217       1 options.go:135] EnableContentionProfiling set to false.
I0719 13:51:09.906224       1 options.go:136] AgentID set to 73a9eadf-d097-48ac-9ab1-18bbb0923d1f.
I0719 13:51:09.906230       1 options.go:137] SyncInterval set to 1s.
I0719 13:51:09.906236       1 options.go:138] ProbeInterval set to 1s.
I0719 13:51:09.906240       1 options.go:139] SyncIntervalCap set to 10s.
I0719 13:51:09.906243       1 options.go:140] Keepalive time set to 30s.
I0719 13:51:09.906247       1 options.go:141] ServiceAccountTokenPath set to "".
I0719 13:51:09.906253       1 options.go:142] AgentIdentifiers set to default-route=true.
I0719 13:51:09.906258       1 options.go:143] WarnOnChannelLimit set to false.
I0719 13:51:09.906262       1 options.go:144] SyncForever set to false.
I0719 13:51:11.616912       1 server.go:128] %s check failed:
%v server-connected [+]ping ok
[-]server-connected failed: no servers connected

I0719 13:51:14.945772       1 client.go:210] "Connect to server" serverID="3eeb13b7-a7a9-4627-8630-5fd750e0a013"
I0719 13:51:14.945795       1 clientset.go:222] "sync added client connecting to proxy server" serverID="3eeb13b7-a7a9-4627-8630-5fd750e0a013"
I0719 13:51:14.945830       1 client.go:321] "Start serving" serverID="3eeb13b7-a7a9-4627-8630-5fd750e0a013" agentID="73a9eadf-d097-48ac-9ab1-18bbb0923d1f"
I0719 13:51:16.043727       1 client.go:210] "Connect to server" serverID="3eeb13b7-a7a9-4627-8630-5fd750e0a013"
I0719 13:51:17.157205       1 client.go:210] "Connect to server" serverID="3eeb13b7-a7a9-4627-8630-5fd750e0a013"
I0719 13:51:18.247053       1 client.go:210] "Connect to server" serverID="2c3d0c09-5890-400c-888c-176b6b589a20"
I0719 13:51:18.247090       1 clientset.go:222] "sync added client connecting to proxy server" serverID="2c3d0c09-5890-400c-888c-176b6b589a20"
I0719 13:51:18.247114       1 client.go:321] "Start serving" serverID="2c3d0c09-5890-400c-888c-176b6b589a20" agentID="73a9eadf-d097-48ac-9ab1-18bbb0923d1f"

and the I see a bunch of errors:

I0719 14:42:36.801674       1 client.go:420] "error dialing backend" error="dial tcp 10.1.0.5:19100: connect: connection refused" dialID=3151522467267294482 connectionID=193 dialAddress="10.1.0.5:19100"
I0719 14:42:43.520632       1 client.go:528] "remote connection EOF" connectionID=194
I0719 14:42:54.327954       1 client.go:528] "remote connection EOF" connectionID=195
I0719 14:43:15.198096       1 client.go:420] "error dialing backend" error="dial tcp 10.1.0.4:19100: connect: connection refused" dialID=587837417944810190 connectionID=197 dialAddress="10.1.0.4:19100"
I0719 14:43:23.523991       1 client.go:528] "remote connection EOF" connectionID=198
I0719 14:43:27.183415       1 client.go:420] "error dialing backend" error="dial tcp 10.1.0.6:19100: connect: connection refused" dialID=3741448814715253397 connectionID=200 dialAddress="10.1.0.6:19100"

and they repeat on and on and on. I checked the only security group there (it was created automatically in node pool resource group) and it does not seem to block anything https://imgur.com/a/oKJ3wVf

What feels weird to me is that everything seems to be working fine. I can do all konnectivity-service dependant things (logs, exec) without any problems.

Any ideas?

sharonrosha commented 1 month ago

same problem here

bvrxs2yyw9fq49n5 commented 1 month ago

I found this post https://github.com/kubernetes-sigs/apiserver-network-proxy/issues/588#issuecomment-2000453097 that suggests—from what I understand—these port are used by Node problem detector and Prometheus node-exporter if you deploy Prometheus and Container insights (?) addons via AKS. I have not confirmed it yet, but might try installing aforementioned addons to confirm it

sharonrosha commented 1 month ago

check this out https://learn.microsoft.com/en-us/azure/architecture/operator-guides/aks/aks-triage-node-health "When an AKS cluster is set up with an API server virtual network integration and either an Azure container networking interface (CNI) or an Azure CNI with dynamic pod IP assignment, there's no need to deploy Konnectivity agents. The integrated API server pods can establish direct communication with the cluster worker nodes via private networking."

microsoft-github-policy-service[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had any activity for 21 days. It will be closed if no further activity occurs within 7 days of this comment.

microsoft-github-policy-service[bot] commented 2 weeks ago

This issue will now be closed because it hasn't had any activity for 7 days after stale. bvrxs2yyw9fq49n5 feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.