Make sure upgrades are sequential

anton-johansson commented 5 years ago

This way, we won't stop for example the API server or kube-proxy on all masters or nodes at the same time, causing downtime.

To do:

[ ] Wait for containerd to start (see #47)
[x] Verify that running the playbook work without downtime
[ ] Verify that actual upgrades work without downtime (I guess we can try this with 1.14.3 or 1.15.0-rc.1)

Discussions:

Are the delays and timeouts OK? 5 second before start checking, then check each second and timeout after 30 seconds.
Should these be configurable with Ansible parameters?

anton-johansson commented 5 years ago

Testing right now, noticed an issue. With serial: 1, facts aren't gathered for all hosts, which is necessary for etcd (and maybe more). Looking into it.

Edit: Solved by forcing facts gathering before running etcd tasks, as per 75785dedc6f7386744e80455c5256b9fc160c577.

anton-johansson commented 5 years ago

I've tested running the playbook on an existing cluster without upgrading, and it looks like it does what it's supposed to. One task on one hosts at a time.

I'll probably test this again when 1.15.0 is released (or if I'm bold enough to try a release candidate).

Would be nice to get the containerd up and running. Any ideas for how we can do that, @amimof? Is #47 an alternative?

amimof commented 5 years ago

@anton-johansson I'm not sure if this is the right approach. To run all tasks in serial. The serial property is not supported in Ansible on a per-task basis, which means that all tasks under a play would run in serial. In large clusters that would increase the execution time significantly.

I don't believe that it's necessary to install masters with a rolling strategy. Maybe that even applies to the kubelet and kube-proxy as well, and only containerd should be restarted one host at a time. It all depends on the type of downtime. Is it workload or APIs?

What is your experience of upgrading a production cluster with KTRW?

anton-johansson commented 5 years ago

Hmm, okay.

I did lose connectivity to my services during a period when I upgraded from 1.13 to 1.14. I'm not sure what caused it, but my guess was kube-proxy, but it could of course be containerd. I don't see how though, because containers don't die just because containerd dies, right? kube-proxy however, handles the network proxying to the correct container, which feels super-dangerous if they all are restarted at the same time?

I understand that masters don't need to be in serial, but I would still feel more safe if I could reach my API server during the upgrade, which I can't if I restart them all at the same time.

which means that all tasks under a play would run in serial

This is pretty much what I wanted to accomplish. :) Do you know if this setting is controlled with a variable somehow? I guess it would be nice if it would default to 100%, but you could set it using some parameter. I'd set it to 1 (or maybe 50%, which would effectively be 1/3 or 2/5, etc) in my case.

Or maybe a constant 50% would suffice? It would of course increase the execution time for large clusters, but it wouldn't be that bad with 50%?

amimof commented 5 years ago

@anton-johansson Yes kube-proxy downtime will cause network connectivity to containers to be interrupted. And i'm pretty sure that containers are restarted when containerd is restarted. Here is an example of how we could restart a service per host without use of serial. But I still haven't figured out how to check if the service is up in order to move on to the next host.

- name: Start and enable kubelet
  systemd:
    name: kubelet
    state: restarted
    enabled: True
  with_items: "{{ groups['nodes'] }}"
  run_once: True
  delegate_to: "{{ item }}"

Another solution would be to restart everything at the very end, putting a section in install.yml. Something like below. However this would make the individual roles a bit less independent in my opinion. So my idea is to pass in a state variable to the roles that can controll if the component should be restarted after it is installed. The default would be true since we want the roles to be self-contained. So the resulting install.yml could look something like this:

- hosts: localhost
  connection: local
  roles:
    - certificates

- hosts: etcd
  roles:
    - { role: etcd, vars: { restart: false }}

- hosts: masters
  roles:
    - { role: kube-apiserver, vars: { restart: false }}
    - { role: kube-controller-manager, vars: { restart: false }}
    - { role: kube-scheduler, vars: { restart: false }}

- hosts: nodes
  roles:
    - cni
    - { role: containerd, vars: { restart: false }}
    - runc
    - { role: kube-proxy, vars: { restart: false }}
    - { role: kubelet, vars: { restart: false }}

# Restart components in serial
- hosts: nodes
  serial: 1
  roles:
    - { role: containerd, vars: { restart: true }}
    - { role: kube-proxy, vars: { restart: true }}
    - { role: kubelet, vars: { restart: true }}

- hosts: etcd
  serial: 1
  roles:
    - { role: etcd, vars: { restart: true }}

- hosts: masters
  serial: 1
  roles:
    - { role: containerd, vars: { restart: true }}
    - { role: kube-proxy, vars: { restart: true }}
    - { role: kubelet, vars: { restart: true }}

anton-johansson commented 5 years ago

I like the last suggestion, where we only restart the services in serial.

But wouldn't we need two variables? One that indicates that we should prepare everything (download binaries, update configuration files, etc) and one that indicates that the service should be restarted. Both could default to true so the role is independant and can be executed standalone. But in the install playbook, first part (without serial) would only prepare, while the last part (in serial) would only restart (and wait for service to come back up)?

EDIT: Good to know that stopping containerd stops containers, I wasn't aware of that.

amimof commented 5 years ago

Yeah you are right, we need two variables. state: present|absent and restart: true|false

anton-johansson commented 5 years ago

I can see if I can get something like that up and running.

amimof commented 5 years ago

@anton-johansson So i've been doing some experimenting with my initial suggestion where I proposed putting all restarts in the end of the playbook. I have come to the conclusion that this is an anti-pattern in terms of Ansible best practices. This is due to how everything evolves around Plays and not Playbooks.

The "best" possible implementation (Ansible-wise) is to use Handlers. But handlers alone doesn't solve the sequential restart problem. However by combining handlers with Rolling Updates (serial) we can achieve what we wan't in a good way.

So I propose that we make sure that users may set the serial value to all plays (with a default). That way it would be up to the user to determine how updates are applied. For example in large clusters (>50 nodes) it may be perfectly fine to have 2 or more nodes down for maintenance which would translate to serial: 2. What is cool is that Ansible supports passing a percentage value to serial. So one could set serial: 5%.

I've opened a PR that adds restart-handlers to all appropriate roles. Have a look and let me know what you think

anton-johansson commented 5 years ago

Yeha, that sounds good. What would be a good default for this value though? I mentioned earlier in this PR that if we use 50% as the default, it would effectively be 1/3 or 2/5, etc, which feels like a good default for high availability.

Or would you want it to be 100% by default, indicating pure parallellism?

anton-johansson commented 5 years ago

We can move the discussion over to #52. Closing this!

anton-johansson commented 5 years ago

My mistake, this PR is not really related to #52. Re-opening and see if I can make the appropriate changes.

anton-johansson commented 5 years ago

I've made some changes @amimof, but I have yet to test this thoroughly. I'll remove [WIP] from the title when tested.

anton-johansson commented 5 years ago

It appears that host variables cannot be used in the scope of the playbook, only tasks within the playbook.

See ansible/ansible#18131

How about using --extra-vars?

For example:

$ ansible-playbook --inventory ansible-inventory --extra-vars "serial_masters=50%" ~/projects/kubernetes-the-right-way/install.yml

anton-johansson commented 5 years ago

I've made some changes to utilize the command line argument --extra-vars. This works well. As you can see, I had to gather facts for all etcd hosts, because when using serial it wont gather facts for all of them immediately.

I think this needs to be done for kube-apiserver as well, if we want plays to be independant of other plays. Could you confirm this?

Also, this means that we'll run the facts gathering even if we don't need to (i.e. if we don't use serial). Do you know if there's a way to prevent this? Maybe use when? Something like:

when: hostvars[item]['ansible_hostname'] is not defined

amimof commented 5 years ago

I've made some changes to utilize the command line argument --extra-vars. This works well. As you can see, I had to gather facts for all etcd hosts, because when using serial it wont gather facts for all of them immediately.

I think this needs to be done for kube-apiserver as well, if we want plays to be independant of other plays. Could you confirm this?

Also, this means that we'll run the facts gathering even if we don't need to (i.e. if we don't use serial). Do you know if there's a way to prevent this? Maybe use when? Something like:
when: hostvars[item]['ansible_hostname'] is not defined

Nice job! I also encountered that serial_* needs to passed as an extra variable but I think it's fine for now. I can confirm that gathering facts is also required for kube-apiserver. See: https://github.com/amimof/kubernetes-the-right-way/blob/1499f869146507585fcf9331aba35dc59a75e698/roles/kube-apiserver/templates/kube-apiserver.service.j2#L21

Also I don't mind gathering facts even if it isn't needed. It should be possible with a simple when but this needs to be tested.

anton-johansson commented 5 years ago

Fixed kube-apiserver.

Also squashed together some commits to avoid redundant messages. Kept some commit separate because they're actually informational.

amimof commented 5 years ago

@anton-johansson It might be a good idea to also make use of serial in the CI pipeline by passing -e "serial_all=50%" to https://github.com/amimof/kubernetes-the-right-way/blob/1499f869146507585fcf9331aba35dc59a75e698/.travis.yml#L22 Other than that LGTM and should be good to merge

anton-johansson commented 5 years ago

Good idea! Done, lets see what Travis says. :)

amimof / kubernetes-the-right-way

Make sure upgrades are sequential #48