att-comdev / halcyon-vagrant-kubernetes

Vagrant deployment mechanism for halcyon-kubernetes.
Apache License 2.0
12 stars 14 forks source link

FailedSync in networking layer #50

Closed sdake closed 7 years ago

sdake commented 7 years ago

cloned repo d6a500265e5d43e34f40e14c210584052610afca

The submodule is not up to date, so I updated that to obtain the latest version of halcyon-kubernetes by running: git submodule foreach git pull origin master

After following instructions here: http://docs.openstack.org/developer/kolla-kubernetes/development-environment.html

Ran into the following:

[sdake@minime-03 halcyon-vagrant-kubernetes]$ kubectl describe pods Name: 33400817-9413-4636-82f3-34c35d054a95 Namespace: default Node: 172.16.35.14/172.16.35.14 Start Time: Thu, 26 Jan 2017 23:49:17 -0500 Labels: Status: Pending IP:
Controllers: Containers: 33400817-9413-4636-82f3-34c35d054a95: Container ID:
Image: busybox Image ID:
Port:
State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Volume Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-26m8j (ro) Environment Variables: Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: default-token-26m8j: Type: Secret (a volume populated by a Secret) SecretName: default-token-26m8j QoS Class: BestEffort Tolerations: Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message


11m 11m 1 {default-scheduler } Normal Scheduled Successfully assigned 33400817-9413-4636-82f3-34c35d054a95 to 172.16.35.14 10m 0s 636 {kubelet 172.16.35.14} Warning FailedSync Error syncing pod, skipping: failed to "SetupNetwork" for "33400817-9413-4636-82f3-34c35d054a95_default" with SetupNetworkError: "Failed to setup network for pod \"33400817-9413-4636-82f3-34c35d054a95_default(f5608278-e44b-11e6-9cf2-5254002f7f54)\" using network plugins \"cni\": cni config unintialized; Skipping pod"

sdake commented 7 years ago

No network driver (calico, flannel, weave tried) works.

v1k0d3n commented 7 years ago

i'll update now. @nadeemanwar this needs to be P1. jenkins to autoupdate submodule when a new commit is made on halcyon-kubernetes. i'd like to get this working ASAP, especially since kolla is releasing instructions on their repo soon (monday as @sdake mentioned in IRC).

intlabs commented 7 years ago

@sdake Unfortunately I cannot reproduce :( could you paste the output from ansible? and run (set -x; kubectl version; ansible --version; vagrant --version) so I can try and get to the bottom of whats going on?

v1k0d3n commented 7 years ago

well, @intlabs ...it is true that submodule needs updated. let me update manually now.

but then again...this can always be updated manually :)

sdake commented 7 years ago

@intlabs will do - note another person bmace has the same exact issue so it isn't a single time thing. I also have a tarball i can give you of an environment that works (that is a few weeks old iirc).

v1k0d3n commented 7 years ago

submodule updated, in case that may be the issue @sdake. try to clone or update the submodule again?

sdake commented 7 years ago

@v1k0d3n Thanks for updating submodule. You will note in the original report that I had updated the submodule myself, but may have done it incorrectly. I will test this evening with the revised documentation (there was a documentation bug in kolla-kubernetes - a typo) as well as the new submodule. Note that bmace had not used the documentation to C&P so we don't think it was the typo.

v1k0d3n commented 7 years ago

ah, i see now...you pulled from master; cool. could you report what hypervisor you're using too? we're going to start using templates for issues in github soon, but until that point...we need to ask versions, hypervisor info, os, etc. getting there.

sdake commented 7 years ago

Sure happy to help.

Host OS: CentOS 7.2 hypervisor: libvirt (latest yum update version - can edit with version when I have access to my lab) Virtual machine OS: CentOS (not sure which version) Memory: 32gb CPU: Intel core i7 with hyperthreading

Note I may not actually get to fulfilling all of the requests until the 28th of January.

intlabs commented 7 years ago

@sdake if you could run and paste the output from the following from the root for your halcyon-vagrant-kubernetes dir it would be awesome:

(set -x; kubectl version; ansible --version; vagrant --version; git log -1 --format="%H"; cd halcyon-kubernetes; git log -1 --format="%H")

It would also be great if you could post the full config.rb you have, from the snippets I looked through on IRC I'm wondering if somehow it's still using my old fork, as this would explain the controller-manager image version that it seems to be trying to use?

v1k0d3n commented 7 years ago

@sdake what's the status of this issue? still outstanding?

sdake commented 7 years ago

@v1k0d3n I seem to have fallen off of a cliff - in fact my wife is traveling so I have no backup for my children. They have a wacky schedule (that is abnormal) the last 7 days (she is back late tonight) and hopefully I will be able to test things.

I did hear from Borne Mace that he is suffering a different issue then I reported also related to networking. I had hoped he would report in the halcyon-vagrant-kubernetes issue tracker. I'll ping him in irc and ask him to provide his feedback (which may be a different issue). His current reported state is that he was able to launch busybox, so this issue may be invalid and an artifact of the fact that the kolla development environment documentation is incorrect. I want to validate this personally since I opened the issue tracker before closing it out.

intlabs commented 7 years ago

@v1k0d3n This has been reported as resolved in conversations on IRC with @sdake. Unfortunately, we never quite got to the exact root cause, so I think this may have been a transient issue with a distro package? I've also got confirmation that networking is working for Borne Mace, so closing for now.

vhosakot commented 7 years ago

@sdake, all, I saw the exact same error (below) yesterday.

Warning FailedSync  Error syncing pod, skipping: failed to "SetupNetwork" for "33400817-9413-4636-82f3-34c35d054a95_default" with SetupNetworkError: "Failed to setup network for pod "33400817-9413-4636-82f3-34c35d054a95_default(f5608278-e44b-11e6-9cf2-5254002f7f54)" using network plugins "cni": cni config unintialized; Skipping pod"

After waiting 30-40 minutes, I observed that k8s self-healed automatically, and I was able to run the busybox pod successfully without this error.

Here is my analysis, I think we need to give k8s some time to stabilize all of its system objects (pods, deployments and replicasets).

I think k8s is stabilized when we see:

1) Running under the STATUS column for the k8s pods. See below.

# kubectl --namespace=kube-system get pods
NAME                                    READY     STATUS    RESTARTS   AGE
dummy-2088944543-vjhmg                  1/1       Running   0          1d
etcd-172.16.35.11                       1/1       Running   0          1d
kube-apiserver-172.16.35.11             1/1       Running   4          1d
kube-controller-manager-172.16.35.11    1/1       Running   0          1d
kube-discovery-1769846148-vmg4d         1/1       Running   0          1d
kube-dns-2924299975-wpbgf               4/4       Running   0          1d
kube-proxy-7hn7j                        1/1       Running   1          1d
kube-proxy-gxf9w                        1/1       Running   0          1d
kube-proxy-vcckr                        1/1       Running   0          1d
kube-proxy-w6jhx                        1/1       Running   0          1d
kube-scheduler-172.16.35.11             1/1       Running   1          1d
kubernetes-dashboard-3203831700-1fg2q   1/1       Running   0          1d
tiller-deploy-2885612843-9jjhv          1/1       Running   0          1d
weave-net-4rxlv                         2/2       Running   0          1d
weave-net-fmwm4                         2/2       Running   1          1d
weave-net-hjw8t                         2/2       Running   1          1d
weave-net-p6fz6                         2/2       Running   125        1d

I saw the error when some pods above had ContainerCreating under the STATUS column. I had to wait few minutes until all the pods showed Running under the STATUS column.

2) 1 under the AVAILABLE column for all the k8ds deployments. See below.

# kubectl --namespace=kube-system get deployments
NAME                   DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
kube-discovery         1         1         1            1           1d
kube-dns               1         1         1            1           1d
kubernetes-dashboard   1         1         1            1           1d
tiller-deploy          1         1         1            1           1d

3) 1 under the READY column for all the replicasets. See below.

# kubectl --namespace=kube-system get replicasets
NAME                              DESIRED   CURRENT   READY     AGE
dummy-2088944543                  1         1         1         1d
kube-discovery-1769846148         1         1         1         1d
kube-dns-2924299975               1         1         1         1d
kubernetes-dashboard-3203831700   1         1         1         1d
tiller-deploy-2885612843          1         1         1         1d