cablespaghetti / kubeadm-aws

Really cheap Kubernetes cluster on AWS with kubeadm
Other
864 stars 58 forks source link

flannel issue when restoring #13

Closed stefansundin closed 5 years ago

stefansundin commented 5 years ago

Hi!

I'm testing out this project now, and it's great! But I have an issue when trying to restore my etcd snapshot, after manually killing the instance. I am testing without worker nodes for now.

kubeadm runs successfully, but flannel is crash looping and my other pods are not coming up.

I have modified things a bit, but I think the only significant difference is that I am running Kubernetes 1.14.0. I made sure that my new instance has the same private IP as the old instance.

I have a feeling that this is something iptables-related, but I don't know how to figure this one out. Have anyone else seen this?

$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                     READY   STATUS              RESTARTS   AGE
kube-system   coredns-fb8b8dccf-4ls8m                  0/1     ContainerCreating   0          63m
kube-system   coredns-fb8b8dccf-glwwj                  0/1     ContainerCreating   0          63m
kube-system   etcd-ip-172-17-32-4                      1/1     Running             0          62m
kube-system   kube-apiserver-ip-172-17-32-4            1/1     Running             0          62m
kube-system   kube-controller-manager-ip-172-17-32-4   1/1     Running             0          62m
kube-system   kube-flannel-ds-amd64-vchsr              1/1     Running             6          51m
kube-system   kube-proxy-jb2lz                         1/1     Running             0          63m
kube-system   kube-scheduler-ip-172-17-32-4            1/1     Running             0          62m
rssbox        redis-84c4b5d656-v52nj                   0/1     ContainerCreating   0          33m
rssbox        rssbox-587f987bb8-nv9jb                  0/1     ContainerCreating   0          33m

$ kubectl logs kube-flannel-ds-amd64-vchsr --namespace=kube-system
I0402 04:02:34.129476       1 main.go:475] Determining IP address of default interface
I0402 04:02:34.129712       1 main.go:488] Using interface with name ens5 and address 172.17.32.4
I0402 04:02:34.129732       1 main.go:505] Defaulting external address to interface address (172.17.32.4)

$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                     READY   STATUS              RESTARTS   AGE
kube-system   coredns-fb8b8dccf-4ls8m                  0/1     ContainerCreating   0          63m
kube-system   coredns-fb8b8dccf-glwwj                  0/1     ContainerCreating   0          63m
kube-system   etcd-ip-172-17-32-4                      1/1     Running             0          62m
kube-system   kube-apiserver-ip-172-17-32-4            1/1     Running             0          62m
kube-system   kube-controller-manager-ip-172-17-32-4   1/1     Running             0          63m
kube-system   kube-flannel-ds-amd64-vchsr              1/1     Running             6          51m
kube-system   kube-proxy-jb2lz                         1/1     Running             0          63m
kube-system   kube-scheduler-ip-172-17-32-4            1/1     Running             0          62m
rssbox        redis-84c4b5d656-v52nj                   0/1     ContainerCreating   0          33m
rssbox        rssbox-587f987bb8-nv9jb                  0/1     ContainerCreating   0          33m

$ kubectl logs kube-flannel-ds-amd64-vchsr --namespace=kube-system
I0402 04:02:34.129476       1 main.go:475] Determining IP address of default interface
I0402 04:02:34.129712       1 main.go:488] Using interface with name ens5 and address 172.17.32.4
I0402 04:02:34.129732       1 main.go:505] Defaulting external address to interface address (172.17.32.4)
E0402 04:03:04.131619       1 main.go:232] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/kube-flannel-ds-amd64-vchsr': Get https://10.96.0.1:443/api/v1/namespaces/kube-system/pods/kube-flannel-ds-amd64-vchsr: dial tcp 10.96.0.1:443: i/o timeout

$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                     READY   STATUS              RESTARTS   AGE
kube-system   coredns-fb8b8dccf-4ls8m                  0/1     ContainerCreating   0          63m
kube-system   coredns-fb8b8dccf-glwwj                  0/1     ContainerCreating   0          63m
kube-system   etcd-ip-172-17-32-4                      1/1     Running             0          62m
kube-system   kube-apiserver-ip-172-17-32-4            1/1     Running             0          62m
kube-system   kube-controller-manager-ip-172-17-32-4   1/1     Running             0          63m
kube-system   kube-flannel-ds-amd64-vchsr              0/1     Error               6          52m
kube-system   kube-proxy-jb2lz                         1/1     Running             0          63m
kube-system   kube-scheduler-ip-172-17-32-4            1/1     Running             0          62m
rssbox        redis-84c4b5d656-v52nj                   0/1     ContainerCreating   0          33m
rssbox        rssbox-587f987bb8-nv9jb                  0/1     ContainerCreating   0          33m

And then after a moment, the state changes from Error to CrashLoopBackOff.

stefansundin commented 5 years ago

Good news. I think I figured it out.

I tried a lot of things. I tested Kubernetes 1.13, 1.12, and different versions of Flannel, etc. I even tried the Amazon CNI. It all worked fine until I tried to terminate my instance and restore from a backup.

I eventually found out that Docker uses 172.17.0.0/16 for its internal network, which co-incidentally overlapped with my VPC CIDR range. So when I figured this out I immediately thought this was the problem. I recreated my VPC with a 10.0.0.0/16 range. However, I still had the same issue. Bummer.

Eventually I figured out that kube-proxy is responsible for setting up the iptables rules, and it was having issues communicating with the apiserver. It looks like the apiserver was not recognizing the token.

There are a lot of keys and certs present in /etc/kubernetes/pki/, so I decided to try to back all of them up, and not only ca.crt and ca.key. And incredibly enough, this seems to have worked!

My question is now.. how did this ever work? Did you test it thoroughly enough? I can't see how this ever worked??