conjure-up / spells

conjure-up spells registry
79 stars 37 forks source link

canonical-kubernetes fails to deploy with "Waiting for kube-system pods to start", dir storage selected #233

Closed alphec closed 5 years ago

alphec commented 5 years ago

I have the same symptoms with conjure-up as described https://github.com/conjure-up/spells/issues/230. However, I made sure that storage driver is set to dir. So it must be something else.

$ snap list
Name        Version              Rev   Tracking  Publisher   Notes
conjure-up  2.6.1-20181018.1610  1031  stable    canonical✓  classic
core        16-2.35.5            5742  stable    canonical✓  core
lxd         3.7                  9564  stable    canonical✓  -
$ juju --version
2.4.3-bionic-amd64
$ lxc --version
3.7
$  lxc storage list
+------------+-------------+--------+------------------------------------------------+---------+
|    NAME    | DESCRIPTION | DRIVER |                     SOURCE                     | USED BY |
+------------+-------------+--------+------------------------------------------------+---------+
| default    |             | dir    | /var/snap/lxd/common/lxd/storage-pools/default | 11      |
+------------+-------------+--------+------------------------------------------------+---------+
| juju-btrfs |             | btrfs  | /var/snap/lxd/common/lxd/disks/juju-btrfs.img  | 0       |
+------------+-------------+--------+------------------------------------------------+---------+
| juju-zfs   |             | zfs    | /var/snap/lxd/common/lxd/disks/juju-zfs.img    | 0       |
+------------+-------------+--------+------------------------------------------------+---------+
$ juju status
Model                         Controller                Cloud/Region         Version  SLA          Timestamp
conjure-canonical-kubern-b44  conjure-up-localhost-bb3  localhost/localhost  2.4.3    unsupported  08:10:08+01:00

App                    Version  Status   Scale  Charm                  Store       Rev  OS      Notes
easyrsa                3.0.1    active       1  easyrsa                jujucharms  117  ubuntu
etcd                   3.2.10   active       3  etcd                   jujucharms  209  ubuntu
flannel                0.10.0   active       4  flannel                jujucharms  146  ubuntu
kubeapi-load-balancer  1.14.0   active       1  kubeapi-load-balancer  jujucharms  162  ubuntu  exposed
kubernetes-master      1.12.2   waiting      2  kubernetes-master      jujucharms  219  ubuntu
kubernetes-worker      1.12.2   waiting      2  kubernetes-worker      jujucharms  239  ubuntu  exposed

Unit                      Workload  Agent  Machine  Public address  Ports           Message
easyrsa/0*                active    idle   0        10.254.248.156                  Certificate Authority connected.
etcd/0*                   active    idle   1        10.254.248.3    2379/tcp        Healthy with 3 known peers
etcd/1                    active    idle   2        10.254.248.114  2379/tcp        Healthy with 3 known peers
etcd/2                    active    idle   3        10.254.248.2    2379/tcp        Healthy with 3 known peers
kubeapi-load-balancer/0*  active    idle   4        10.254.248.160  443/tcp         Loadbalancer ready.
kubernetes-master/0       active    idle   5        10.254.248.108  6443/tcp        Kubernetes master running.
  flannel/0*              active    idle            10.254.248.108                  Flannel subnet 10.1.86.1/24
kubernetes-master/1*      waiting   idle   6        10.254.248.25   6443/tcp        Waiting for kube-system pods to start
  flannel/1               active    idle            10.254.248.25                   Flannel subnet 10.1.17.1/24
kubernetes-worker/0       waiting   idle   7        10.254.248.186  80/tcp,443/tcp  Waiting for kubelet to start.
  flannel/3               active    idle            10.254.248.186                  Flannel subnet 10.1.15.1/24
kubernetes-worker/1*      waiting   idle   8        10.254.248.142  80/tcp,443/tcp  Waiting for kubelet to start.
  flannel/2               active    idle            10.254.248.142                  Flannel subnet 10.1.90.1/24

Entity  Meter status  Message
model   amber         user verification pending

Machine  State    DNS             Inst id        Series  AZ  Message
0        started  10.254.248.156  juju-cf90b1-0  bionic      Running
1        started  10.254.248.3    juju-cf90b1-1  bionic      Running
2        started  10.254.248.114  juju-cf90b1-2  bionic      Running
3        started  10.254.248.2    juju-cf90b1-3  bionic      Running
4        started  10.254.248.160  juju-cf90b1-4  bionic      Running
5        started  10.254.248.108  juju-cf90b1-5  bionic      Running
6        started  10.254.248.25   juju-cf90b1-6  bionic      Running
7        started  10.254.248.186  juju-cf90b1-7  bionic      Running
8        started  10.254.248.142  juju-cf90b1-8  bionic      Running

When I take a look at the kubernetes-worker/0 log on machine 7, I see:

...
2018-11-20 07:13:34 DEBUG update-status Error from server (NotFound): nodes "juju-cf90b1-7" not found
2018-11-20 07:13:34 INFO juju-log Failed to apply label juju-application=kubernetes-worker. Will retry.
2018-11-20 07:13:35 DEBUG update-status Error from server (NotFound): nodes "juju-cf90b1-7" not found
2018-11-20 07:13:35 INFO juju-log Failed to apply label juju-application=kubernetes-worker. Will retry.
2018-11-20 07:13:36 DEBUG update-status Error from server (NotFound): nodes "juju-cf90b1-7" not found
2018-11-20 07:13:36 INFO juju-log Failed to apply label juju-application=kubernetes-worker. Will retry.
2018-11-20 07:13:37 DEBUG update-status Error from server (NotFound): nodes "juju-cf90b1-7" not found
2018-11-20 07:13:37 INFO juju-log Failed to apply label juju-application=kubernetes-worker. Will retry.
2018-11-20 07:13:38 DEBUG update-status Error from server (NotFound): nodes "juju-cf90b1-7" not found
2018-11-20 07:13:38 INFO juju-log Failed to apply label juju-application=kubernetes-worker. Will retry.
2018-11-20 07:13:40 INFO juju-log Failed to apply label juju-application=kubernetes-worker. Will retry.
2018-11-20 07:13:40 INFO juju-log Invoking reactive handler: reactive/kubernetes_worker.py:947:enable_gpu
2018-11-20 07:13:40 INFO juju-log Enabling gpu mode
2018-11-20 07:13:40 DEBUG update-status NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
2018-11-20 07:13:40 DEBUG update-status
2018-11-20 07:13:40 INFO juju-log Unable to communicate with the NVIDIA driver.
2018-11-20 07:13:40 INFO juju-log CalledProcessError(9, ['nvidia-smi'])
2018-11-20 07:13:40 INFO juju-log Invoking reactive handler: reactive/kubernetes_worker.py:1010:notify_master_gpu_not_enabled
2018-11-20 07:13:40 INFO juju-log Setting gpu=False on kube-control relation
2018-11-20 07:13:40 INFO juju-log Invoking reactive handler: reactive/kubernetes_worker.py:1019:request_kubelet_and_proxy_credentials
2018-11-20 07:13:40 INFO juju-log Invoking reactive handler: reactive/kubernetes_worker.py:1031:catch_change_in_creds
2018-11-20 07:13:40 INFO juju-log Invoking reactive handler: reactive/kubernetes_worker.py:1071:fix_iptables_for_docker_1_13
2018-11-20 07:13:40 INFO juju-log Invoking reactive handler: reactive/kubernetes_worker.py:1185:clear_cloud_flags
2018-11-20 07:13:40 INFO juju-log Invoking reactive handler: reactive/docker.py:358:enable_grub_cgroups
2018-11-20 07:13:40 INFO juju-log Invoking reactive handler: reactive/docker.py:368:signal_workloads_start
2018-11-20 07:13:40 DEBUG update-status Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
...

I installed Canonical Distribution of Kubernetes so I'm wondering why a nVidia driver is called and the access errors to node juju-cf90b1-7 and the docker socket seems like issues to me.

Also on that machine:

$ sudo service docker status
● docker.service - Docker Application Container Engine
   Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Mon 2018-11-19 23:07:29 UTC; 8h ago
     Docs: https://docs.docker.com
 Main PID: 23310 (code=killed, signal=TERM)

Nov 19 23:07:29 juju-cf90b1-7 systemd[1]: Starting Docker Application Container Engine...
Nov 19 23:07:29 juju-cf90b1-7 systemd[1]: Dependency failed for Docker Application Container Engine.
Nov 19 23:07:29 juju-cf90b1-7 systemd[1]: docker.service: Job docker.service/start failed with result 'dependency'.
Nov 19 23:07:29 juju-cf90b1-7 systemd[1]: Stopped Docker Application Container Engine.

Any help is appreciated!

Cynerva commented 5 years ago

It looks like it is detecting your GPU hardware and trying to install nvidia-docker and friends. You can override this by configuring the kubernetes-worker charm with docker_runtime=apt.

You can do this in the "configure applications" screen of conjure-up by going to the "Configure" section of kubernetes-worker, and going to "Show Advanced Configuration".

Alternatively, you can also do this by using the --bundle-add option. Create an overlay.yaml file with:

services:
  kubernetes-worker:
    options:
      docker_runtime: apt

Then run conjure-up with:

conjure-up canonical-kubernetes --bundle-add overlay.yaml
alphec commented 5 years ago

Thanks that fixed it. Still I'm curious why there is a nVidia Kuberentes worker option if it anyway tries to use the GPU. Does that makes sense?