coreos / tectonic-installer

Install a Kubernetes cluster the CoreOS Tectonic Way: HA, self-hosted, RBAC, etcd Operator, and more
Apache License 2.0
601 stars 267 forks source link

Kube-flannel may configure Flannel with a stale interface IP #2439

Open aknuds1 opened 6 years ago

aknuds1 commented 6 years ago

Versions

What happened?

Flannel pods fail on account of not finding their corresponding network interfaces. It seems to me this happens because kube-flannel configures Flannel with an interface IP that can be stale.

From debugging I found that failures were caused by Flannel being configured with an interface IP corresponding to nodes from previous clusters of mine, so I'm gonna guess that kube-flannel's $(POD_IP) variable is determined from DNS lookups against nodes. If the DNS lookups return cached/stale values, Flannel will be configured wrongly as seems to happen in my case.

What I'm wondering is why it's necessary to configure Flannel's --iface option explicitly, instead of letting it determine it automatically? I don't know how Flannel's automatic interface detection works, but hopefully it's less fragile than the current solution, which seems to break when DNS lookups return stale IPs.

enxebre commented 6 years ago

Hey @aknuds1 making it explicit ensures consistency and robustness across different platforms and environments, the POD_IP value comes directly from the ip assigned to the pod at creation time by kubernetes https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/ so I can't see how it can be stale

aknuds1 commented 6 years ago

@enxebre Consider what I said initially: "I'm gonna guess that kube-flannel's $(POD_IP) variable is determined from DNS lookups against nodes. If the DNS lookups return cached/stale values, Flannel will be configured wrongly as seems to happen in my case".

So, it doesn't seem to work as well in practice as you seem to think, unfortunately. I have seen myself that $POD_IP is stale, so there's no question about that. I'm not sure how it happens, but like I said my theory is that it's because of DNS lookups returning stale values.

There is also the question if the method of using Flannel's --iface flag works better than Flannel's own automatic detection, that is my question here. I see that the current method breaks, so we should investigate if letting Flannel detect automatically works better.