Praqma / LearnKubernetes

Notes and resources collected together to help learn Kubernetes. This will eventually become a tutorial and later a blog post for praqma website (hopefully!)
527 stars 223 forks source link

Nodes stuck in Pending state when created #2

Closed KamranAzeem closed 8 years ago

KamranAzeem commented 8 years ago

I saw nodes stuck in pending state when created. There were no events reported by kubectl.

[fedora@ip-172-31-39-228 ~]$ kubectl get pods
NAME      READY     STATUS    RESTARTS   AGE
www       0/1       Pending   0          30m
[fedora@ip-172-31-39-228 ~]$

Also kubectl is timing out trying to delete a deployment , saying timeout waiting for a condition.

[fedora@ip-172-31-39-228 ~]$ kubectl delete deployment nginx --grace-period=3
error: timed out waiting for the condition
[fedora@ip-172-31-39-228 ~]$
KamranAzeem commented 8 years ago

Turned out that SELINUX was enabled on all of the nodes (including master), which most probably prevented local-registry from running properly (on the master).

The docker container for registry was not running on the master node. This was problem # 1.

The second problem was that the kube-controller-manager was complaining about apiserver was not able to do a tcp dial on a IP address of the master node.

-bash-4.3# service kube-controller-manager status -l
Redirecting to /bin/systemctl status  -l kube-controller-manager.service
● kube-controller-manager.service - Kubernetes Controller Manager
   Loaded: loaded (/usr/lib/systemd/system/kube-controller-manager.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2016-06-13 10:14:42 UTC; 36s ago
     Docs: https://github.com/GoogleCloudPlatform/kubernetes
 Main PID: 4796 (kube-controller)
   Memory: 4.9M
      CPU: 64ms
   CGroup: /system.slice/kube-controller-manager.service
           └─4796 /usr/bin/kube-controller-manager --logtostderr=true --v=0 --master=http://171.31.39.228:8080

Jun 13 10:14:42 ip-172-31-39-228.ap-southeast-2.compute.internal systemd[1]: kube-controller-manager.service: Failed with result 'exit-code'.
Jun 13 10:14:42 ip-172-31-39-228.ap-southeast-2.compute.internal systemd[1]: Started Kubernetes Controller Manager.
Jun 13 10:14:42 ip-172-31-39-228.ap-southeast-2.compute.internal systemd[1]: Starting Kubernetes Controller Manager...
Jun 13 10:14:42 ip-172-31-39-228.ap-southeast-2.compute.internal kube-controller-manager[4796]: I0613 10:14:42.258087    4796 plugins.go:71] No cloud provider specified.
Jun 13 10:14:42 ip-172-31-39-228.ap-southeast-2.compute.internal kube-controller-manager[4796]: I0613 10:14:42.258380    4796 nodecontroller.go:143] Sending events to api server.
Jun 13 10:14:42 ip-172-31-39-228.ap-southeast-2.compute.internal kube-controller-manager[4796]: E0613 10:14:42.258680    4796 controllermanager.go:216] Failed to start service controller: ServiceController should not be run without a cloudprovider.
Jun 13 10:14:42 ip-172-31-39-228.ap-southeast-2.compute.internal kube-controller-manager[4796]: I0613 10:14:42.258838    4796 controllermanager.go:229] allocate-node-cidrs set to false, node controller not creating routes
Jun 13 10:14:42 ip-172-31-39-228.ap-southeast-2.compute.internal kube-controller-manager[4796]: I0613 10:14:42.259526    4796 replication_controller.go:208] Starting RC Manager
Jun 13 10:15:12 ip-172-31-39-228.ap-southeast-2.compute.internal kube-controller-manager[4796]: E0613 10:15:12.259398    4796 controllermanager.go:259] Failed to get api versions from server: Get http://171.31.39.228:8080/api: dial tcp 171.31.39.228:8080: i/o timeout
Jun 13 10:15:12 ip-172-31-39-228.ap-southeast-2.compute.internal kube-controller-manager[4796]: E0613 10:15:12.260945    4796 nodecontroller.go:229] Error monitoring node status: Get http://171.31.39.228:8080/api/v1/nodes: dial tcp 171.31.39.228:8080: i/o timeout
-bash-4.3#

When I tried to curl that IP address on master, it did not work. When I replaced the IP address with the word localhost, it worked:

-bash-4.3# curl http://171.31.39.228:8080/api/v1/nodes
^C

-bash-4.3# curl http://localhost:8080/api/v1/nodes
{
  "kind": "NodeList",
  "apiVersion": "v1",
  "metadata": {
    "selfLink": "/api/v1/nodes",
    "resourceVersion": "53008"
  },
  "items": [
    {
      "metadata": {
        "name": "172.31.39.229",
. . . 
[output snipped ]
. . .  
-bash-4.3#

Whereas kube-api-server was configured to listen on all ports.

-bash-4.3# cat kubernetes/apiserver 
. . . 
KUBE_API_ADDRESS="--insecure-bind-address=0.0.0.0"
. . . 

This is also evident through netstat on master:

-bash-4.3# netstat -ntlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.1:2380          0.0.0.0:*               LISTEN      951/etcd            
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      1019/sshd           
tcp        0      0 127.0.0.1:7001          0.0.0.0:*               LISTEN      951/etcd            
tcp6       0      0 :::5000                 :::*                    LISTEN      3887/docker-proxy   
tcp6       0      0 :::6443                 :::*                    LISTEN      4768/kube-apiserver 
tcp6       0      0 :::2379                 :::*                    LISTEN      951/etcd            
tcp6       0      0 :::10251                :::*                    LISTEN      844/kube-scheduler  
tcp6       0      0 :::10252                :::*                    LISTEN      4943/kube-controlle 
tcp6       0      0 :::8080                 :::*                    LISTEN      4768/kube-apiserver 
tcp6       0      0 :::22                   :::*                    LISTEN      1019/sshd           
-bash-4.3# 

Then I had a look at the IP address once again, and realized that there is a typing error . in /etc/kubernetes/config on master node. The IP should have been 172.31.39.228 and not 171.31.39.228

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc fq_codel state UP group default qlen 1000
    link/ether 0a:5c:30:60:10:d3 brd ff:ff:ff:ff:ff:ff
    inet 172.31.39.228/20 brd 172.31.47.255 scope global dynamic eth0
       valid_lft 3254sec preferred_lft 3254sec
    inet6 fe80::85c:30ff:fe60:10d3/64 scope link 
       valid_lft forever preferred_lft forever
-bash-4.3# 

This incorrect IP was found to be in several config files , so we fixed the IPs in all config files and rebooted master node.

And then it works!

[fedora@ip-172-31-39-228 ~]$ kubectl get pods
NAME      READY     STATUS    RESTARTS   AGE
www       1/1       Running   0          1m
[fedora@ip-172-31-39-228 ~]$ 

(Thank you Rafiqul Islam)