conjure-up / conjure-up

Deploying complex solutions, magically.
https://conjure-up.io
MIT License
452 stars 73 forks source link

canonical-kubernetes fails to deploy kubernetes worker #1150

Open estechnical opened 7 years ago

estechnical commented 7 years ago

This situation seems fairly nonsensical. I have MAAS and would like to deploy kubernetes as per the canonical kubernetes spell.

I have several small servers with 4 cores, 3 large servers with 16 cores.

Conjure-up seems to run fine and I can start the deployment, it will deploy easyrsa etc. Kubernetes worker always gets stuck with 'waiting for machine'.

During conjure-up, I'm choosing to pin the juju machines to particular boxes, this doesn't seem to be honoured during the deployment.

I am new to all this so debugging has been slow, but in my googling I found I could see the output of 'juju status', but the messages I see make little sense, as our large servers have greater than the constraints in CPU cores and RAM, yet "it" seems to be avoiding placing kubernetes worker on these and instead choosing the smaller servers then running out of them.

cannot run instances: cannot run instance: No available machine matches constraints: [('mem', ['4096']), ('agent_name', ['9a3b37de-cf8c-4496-830f-5601d52e
0187']), ('cpu_count', ['4']), ('zone', ['default'])] (resolved to "cpu_count=4.0 mem=4096.0 zone=default")

It appears like the constraints are failing for only our large servers. Or, I'm doing something wrong when pinning the machines?? There seems to be something funny about the menu selecting which juju machine to place on a physical machine.

Please provide the output of the following commands

which juju
/snap/bin/juju

juju version
2.3-alpha1-xenial-amd64

which conjure-up
/snap/bin/conjure-up
conjure-up --version
conjure-up 2.4-alpha1

#LXC is not installed.... should it be for deployment to a MAAS environment?
which lxc
lxc config show
lxc version

cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.3 LTS"

Please attach tarball of ~/.cache/conjure-up: conjure-up.tar.gz

Sosreport

Please attach a sosreport:

sudo apt install sosreport
sosreport <--- this had to be "sudo sosreport" to work for me

# sosreport appears to require root permissions to run properly on ubuntu!
# sosreport generates a .tar.xz file, which github is not allowing me to attach

What Spell was Selected?

canonical-kubernetes

What provider (aws, maas, localhost, etc)?

MAAS

MAAS Users

Which version of MAAS? MAAS version: 2.2.2 (6099-g8751f91-0ubuntu1~16.04.1)

Commands ran

Please outline what commands were run to install and execute conjure-up: Initially I think I started out with this guide https://tutorials.ubuntu.com/tutorial/install-kubernetes-with-conjure-up

so:

sudo snap install conjure-up --classic
conjure-up [kubernetes]
# chose the canonical-kubernetes option

Additional Information

I saw another issue about colocation not working correctly - this also happens to our setup, we see more machines provisioned than we expected.

adam-stokes commented 7 years ago

I saw another issue about colocation not working correctly - this also happens to our setup, we see more machines provisioned than we expected.

Yea :( We are working to fix that soon.

As for MAAS constraints, here is your bundle representation (what juju uses to deploy applications with)

description: A nine-machine Kubernetes cluster, appropriate for production. Includes
  a three-machine etcd cluster and three Kubernetes worker nodes.
machines:
  '0':
    constraints: tags=q4gg3b
    series: xenial
  '1':
    constraints: tags=46p44n
    series: xenial
  '2':
    constraints: tags=7hmwxn
    series: xenial
  '3':
    constraints: tags=kbf4kd
    series: xenial
  '4':
    constraints: tags=bkxxf7
    series: xenial
  '5':
    constraints: tags=7yxcdk
    series: xenial
  '6':
    constraints: tags=spq4qp
    series: xenial
  '7':
    constraints: tags=p6q3rm
    series: xenial
  '8':
    constraints: tags=ahfmmk
    series: xenial
relations:
- - kubernetes-master:certificates
  - easyrsa:client
- - etcd:certificates
  - easyrsa:client
- - kubernetes-worker:certificates
  - easyrsa:client
- - kubeapi-load-balancer:certificates
  - easyrsa:client
- - kubernetes-master:etcd
  - etcd:db
- - kubernetes-master:kube-api-endpoint
  - kubeapi-load-balancer:apiserver
- - kubernetes-master:loadbalancer
  - kubeapi-load-balancer:loadbalancer
- - kubernetes-worker:kube-api-endpoint
  - kubeapi-load-balancer:website
- - kubernetes-master:kube-control
  - kubernetes-worker:kube-control
series: xenial
services:
  easyrsa:
    charm: cs:~containers/easyrsa-15
    num_units: 1
    to:
    - '0'
  etcd:
    charm: cs:~containers/etcd-48
    num_units: 3
    to:
    - '1'
    - '2'
    - '3'
  kubeapi-load-balancer:
    charm: cs:~containers/kubeapi-load-balancer-25
    num_units: 1
    to:
    - '4'
  kubernetes-master:
    charm: cs:~containers/kubernetes-master-47
    num_units: 1
    options:
      channel: 1.7/stable
    to:
    - '5'
  kubernetes-worker:
    charm: cs:~containers/kubernetes-worker-52
    num_units: 3
    options:
      channel: 1.7/stable
    to:
    - '6'
    - '7'
    - '8'

If you take this bundle and save it to a file like bundle.yaml and try to juju deploy --debug ./bundle.yaml

Does the same error happen with no machines matching constraints?

conjure-up will tag your maas machines which is what you see at the top of that bundle I pasted. If you click on those machines in your MAAS web ui do those machines have those tags listed?

I apologize for the complications with MAAS, we are working hard to make that experience a lot better.

estechnical commented 7 years ago

Thanks for your help :) No apologies needed - I like these systems and will contribute what I can... Even if only bug reports and testing...

I think your constraints worked much better than the defaults. I still run into one machine not being placed, which is a little odd. It seems like the juju controller is bootstrapped on the machine in question... might this be to do with the colocation issues?

2        down                  pending  xenial           cannot run instances: cannot run instance: No available machine matches constraints: [('agent_name', ['046b8d50-b48e-42db-8969-cbb602527fea']), ('tags', [
'7hmwxn']), ('zone', ['default'])] (resolved to "tags=7hmwxn zone=default")

The machine with the tag 7hmwxn shows as already deployed in MAAS.

As we have enough machines to overcome this, I'm going to try tagging all our machines eg "small" and "large" and just see if I can get it up and running like that, even if it uses more machines than strictly needed. ...

Now I've tried this, I have 7 machines tagged as "small" and 3 tagged as "huge". Swapping the tags constraints for just "tags=small" and "tags=huge" resulted in a successful deployment using "juju deploy --debug ./bundle.yaml" :)

It has used all the machines and appears to have made a slightly different result to the conjure-up way. I notice that doing it using juju deploy has not placed flannel on anything.

I'm still waiting for what looks like a very final step to complete:

Unit                      Workload  Agent  Machine  Public address  Ports     Message
easyrsa/0*                active    idle   0        10.10.10.11               Certificate Authority connected.
etcd/0                    active    idle   1        10.10.10.12     2379/tcp  Healthy with 3 known peers
etcd/1*                   active    idle   2        10.10.10.13     2379/tcp  Healthy with 3 known peers
etcd/2                    active    idle   3        10.10.10.14     2379/tcp  Healthy with 3 known peers
kubeapi-load-balancer/0*  active    idle   4        10.10.10.15     443/tcp   Loadbalancer ready.
kubernetes-master/0*      waiting   idle   5        10.10.10.16     6443/tcp  Waiting for kube-system pods to start
kubernetes-worker/0*      waiting   idle   6        10.10.10.17               Waiting for kube-proxy to start.
kubernetes-worker/1       waiting   idle   7        10.10.10.18               Waiting for kube-proxy to start.
kubernetes-worker/2       waiting   idle   8        10.10.10.19               Waiting for kube-proxy to start.

I'm available for testing for the rest of this week, I am just taking my first steps with a real kubernetes cluster and expect to refine things a little as I go.

Thanks again :)

estechnical commented 7 years ago

Aha! https://api.jujucharms.com/charmstore/v5/canonical-kubernetes/archive/bundle.yaml

I just added the flannel descriptions and the relations shown in the above bundle and re-ran juju deploy. It's added the flannel parts I expected to see (just from my previous experiments)...

Edited.

IT WORKS :D

adam-stokes commented 7 years ago

I'm going to leave this bug open so that we track the progress of making the placement editor a lot better

estechnical commented 7 years ago

Ok, thanks. Please let me know if you need further testing...

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.