coreos / container-linux-update-operator

A Kubernetes operator to manage updates of Container Linux by CoreOS
Apache License 2.0
209 stars 49 forks source link

agent: no route to host #162

Closed khrisrichardson closed 6 years ago

khrisrichardson commented 6 years ago

I observed the following in a number of long running container-linux-update-agent:v0.4.1 pods.

I0109 18:31:41.765669       1 agent.go:116] Waiting for ok-to-reboot from controller...
E0109 18:31:41.765651       1 agent.go:210] Failed to set annotation "container-linux-update.v1.coreos.com/status": unable to update node "host1.cluster.local": failed to get node "host1.cluster.local": Get https://10.0.0.1:443/api/v1/nodes/host1.cluster.local: dial tcp 10.66.168.1:443: getsockopt: no route to host
W0109 18:31:44.837670       1 agent.go:122] error waiting for an ok-to-reboot: failed to get self node ("host1.cluster.local"): Get https://10.0.0.1:443/api/v1/nodes/host1.cluster.local: dial tcp 10.0.0.1:443: getsockopt: no route to host

Killing them and spawning replacements opened up communication with the kubernetes service and the nodes were able to update. Perhaps some sort of liveness probe to check the health of the connection to the kubernetes service would be advantageous.

Thanks

dghubble commented 6 years ago

I'd expect those log lines to appear ephemerally if the apiserver was briefly unavailable for setting the node annotation. Did you have to restart the agent in order for it to proceed?

khrisrichardson commented 6 years ago

Hi @dghubble. I expected the same thing, but tried connecting to the Kubernetes service socket from another pod on the same node and did not have similar issues.

Another factor was the unexpectedly old version of Container Linux (1465.X.X) on the node in question and the number of days the node had been alive (20+). I have since updated the autoscaling groups of all our clusters so they are referencing the latest CL AMI, so am having a little self-doubt.

Since I have updated all the nodes in our fleet and don't feel I collected sufficient evidence to make my case, even though it seemed that the pod in question was in a degraded state, maybe we ought to close this until I reproduce the issue and have ample supporting evidence that a liveness/readiness probe is in order.

Although there is the fact that killing the supposedly degraded pod addressed the issue...

dghubble commented 6 years ago

Ok. Yes, if you find the pod ends up stuck in this state, even after the apiserver is available, and requires a restart, please do add an issue with any info you can. At the moment, without more info, I'm content to close this as well.