evrardjp / ansible-keepalived

Keepalived role for ansible deployment
Apache License 2.0
98 stars 98 forks source link

Add more execution granularity / idempotency by splitting configuration #209

Closed Kariton closed 2 years ago

Kariton commented 2 years ago

As discussed in https://github.com/evrardjp/ansible-keepalived/issues/200 and drafted in https://github.com/evrardjp/ansible-keepalived/pull/203

I hopefully didnt messed some stuff up.

Kariton commented 2 years ago
  TASK [ansible-keepalived : ensure keepalived is enabled] ***********************
  fatal: [keepalived-centos7]: FAILED! => {"changed": true, "cmd": ["systemctl", "enable", "keepalived", "--now"], "delta": "0:00:00.605824", "end": "2022-07-12 17:45:22.240112", "msg": "non-zero return code", "rc": 1, "start": "2022-07-12 17:45:21.634288", "stderr": "Created symlink from /etc/systemd/system/multi-user.target.wants/keepalived.service to /usr/lib/systemd/system/keepalived.service.\nJob for keepalived.service failed because the control process exited with error code. See \"systemctl status keepalived.service\" and \"journalctl -xe\" for details.", "stderr_lines": ["Created symlink from /etc/systemd/system/multi-user.target.wants/keepalived.service to /usr/lib/systemd/system/keepalived.service.", "Job for keepalived.service failed because the control process exited with error code. See \"systemctl status keepalived.service\" and \"journalctl -xe\" for details."], "stdout": "", "stdout_lines": []}

The current MAIN branch does indeed NOT produce the same error on my side... even though I thought it was an error on my side and looked like a "docker problem"... my VMs do work as expected...

I will see what I can find.

Kariton commented 2 years ago

oh well... old keepalived version FTW:

[root@keepalived-centos7 /]# systemctl status keepalived
● keepalived.service - LVS and VRRP High Availability Monitor
   Loaded: loaded (/usr/lib/systemd/system/keepalived.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/keepalived.service.d
           └─override.conf
   Active: failed (Result: exit-code) since Tue 2022-07-12 18:09:00 UTC; 22s ago
  Process: 1447 ExecStart=/usr/sbin/keepalived $KEEPALIVED_OPTIONS (code=exited, status=3)

Jul 12 18:09:00 keepalived-centos7 systemd[1]: Starting LVS and VRRP High Availability Monitor...
Jul 12 18:09:00 keepalived-centos7 Keepalived[1447]: Starting Keepalived v1.3.5 (03/19,2017), git commit v1.3.5-6-g6fa32f2
Jul 12 18:09:00 keepalived-centos7 Keepalived[1447]: Opening file '/etc/keepalived/keepalived.conf'.
Jul 12 18:09:00 keepalived-centos7 Keepalived[1447]: Unable to find config file(s) '/etc/keepalived/scripts/*.conf'.
Jul 12 18:09:00 keepalived-centos7 systemd[1]: keepalived.service: control process exited, code=exited status=3
Jul 12 18:09:00 keepalived-centos7 systemd[1]: Failed to start LVS and VRRP High Availability Monitor.
Jul 12 18:09:00 keepalived-centos7 systemd[1]: Unit keepalived.service entered failed state.
Jul 12 18:09:00 keepalived-centos7 systemd[1]: keepalived.service failed.
[root@keepalived-centos7 /]# ll /etc/keepalived/scripts/
total 0
evrardjp commented 2 years ago

Thanks for continuing the work on this.

I still have trouble to wrap my head around the reason for the split in different config files.

You mention more granularity/idempotency, but I am not really sure to understand it. Would you mind clarifying? For me, adding variables to do the cleanup is by far the biggest pain point of the split (it means ppl have to read the code instead of just editing their vars)

Kariton commented 2 years ago

Sure:

here is a direct playbook example: inventory

[squiddev]
proxydev01.example.tld
proxydev02.example.tld

[loadbalancers]
loadbalancer01.example.tld
loadbalancer02.example.tld

playbook/squiddev-first.yml

---

- hosts: loadbalancers, proxydev01.example.tld
  vars:
    activeconn_threshold: 5

  pre_tasks:
    - name: "exclude {{ groups['squiddev'][0] }} from loadbalancer"
      ansible.builtin.include_role:
        name: ansible-keepalived
        apply:
          tags:
            - keepalived-config
      when: "'loadbalancers' in group_names"
      vars:
        keepalived_virtual_server_groups:
          - name: proxy
            vips:
              - ip: '172.16.10.30'
                port: 3128
            delay_loop: 5
            protocol: TCP
            lvs_sched: wrr
            lvs_method: DR
            persistence_timeout: 120
            real_servers:
              - ip: '172.28.20.31'
                port: 3128
                weight: 0
                tcp_checks:
                  - connect_port: 3128
                    connect_timeout: 1
                    retry: 2
                    delay_before_retry: 2
              - ip: '172.28.20.32'
                port: 3128
                weight: 1
                tcp_checks:
                  - connect_port: 3128
                    connect_timeout: 1
                    retry: 2
                    delay_before_retry: 2

    - name: Force all notified handlers to run at this point, not waiting for normal sync points
      ansible.builtin.meta: flush_handlers

    - name: Verify weight is set to zero
      ansible.builtin.shell:
        cmd: "ipvsadm -L | grep {{ groups['squiddev'][0] }} | awk '{ print $4 }'"
      register: keepalived_weight
      run_once: true
      changed_when: false
      failed_when: keepalived_weight.stdout | int != 0
      delegate_to: "{{ groups['loadbalancers'][0] }}"

    - name: Wait until 'ActiveConn' are below threshold
      ansible.builtin.shell:
        cmd: "ipvsadm -L | grep {{ groups['squiddev'][0] }} | awk '{ print $5 }'"
      register: keepalived_activeconn
      until: keepalived_activeconn.stdout | int <= activeconn_threshold
      retries: 300
      delay: 5
      run_once: true
      changed_when: false
      delegate_to: "{{ groups['loadbalancers'][0] }}"

  tasks:
    - name: "configure squid {{ groups['squiddev'][0] }}"
      ansible.builtin.include_role:
        name: squid
      when: "'squiddev' in group_names"

    - name: Force all notified handlers to run at this point, not waiting for normal sync points
      ansible.builtin.meta: flush_handlers

  post_tasks:
    - name: "include {{ groups['squiddev'][0] }} in loadbalancer"
      ansible.builtin.include_role:
        name: ansible-keepalived
        apply:
          tags:
            - keepalived-config
      when: "'loadbalancers' in group_names"
      vars:
        keepalived_virtual_server_groups:
          - name: proxy
            vips:
              - ip: '172.16.10.30'
                port: 3128
            delay_loop: 5
            protocol: TCP
            lvs_sched: wrr
            lvs_method: DR
            persistence_timeout: 120
            real_servers:
              - ip: '172.28.20.31'
                port: 3128
                weight: 1
                tcp_checks:
                  - connect_port: 3128
                    connect_timeout: 1
                    retry: 2
                    delay_before_retry: 2
              - ip: '172.28.20.32'
                port: 3128
                weight: 1
                tcp_checks:
                  - connect_port: 3128
                    connect_timeout: 1
                    retry: 2
                    delay_before_retry: 2
Kariton commented 2 years ago

This is just the result from my personal lab and the use case i target with this PR. but now this role offers a few different configurations that can be configured in a more "ad-hoc" way.

A series of playbooks will be able to define the desired state. with a lot of flexibility and granularity. if you want just a floating IP this will be overkill.

if you configure a entire IPVS router that might be needed - at least for me it is.

Kariton commented 2 years ago

you will also be able to just define everything within the group vars or whatever and just update portions as needed for updates or other kinds of maintenance. (my current example works this way)

Kariton commented 2 years ago

For me, adding variables to do the cleanup is by far the biggest pain point of the split (it means ppl have to read the code instead of just editing their vars)

I understand your concerns. This cleanup is not there to delete the configurations in general. Every dict can be deleted on its own. Like here: tests/keepalived_haproxy_combined_edit_example.yml

But if - in the some case eventually - there are leftovers that got not threaded correctly... what will happen? In the worst case keepalived will refuse to start. (hopefully not on the entire keepalived cluster)

You can purge everything and start "fast" over again.

I got inspired for this by the role linux-system-roles/logging. This saved me hours of debugging.

We will surely find a way to describe that clearly within the README.MD and default/main.yml. With examples (the squid one in particular) and such.

Kariton commented 2 years ago

If you dont mind an other PR for full IPVS configuration capabilitys it will follow soon™ Its mostly sysctl related / kernel parameters.

this would bring the potential of this role even further.

Kariton commented 2 years ago

I somehow missed the keepalived instances state handling.

Kariton commented 2 years ago

since be37c73c0661685fe0a4688a5710121c9f6faa18 (removal of official RHEL 7 support) my task to thread RHEL 7 in a special manor (d53e31c0781533e20d5d41fb90d2af6eb88436aa) is no longer needed.

Kariton commented 2 years ago

solved and not needed anymore as i found another solution which is sufficent: https://github.com/evrardjp/ansible-keepalived/issues/200#issuecomment-1184949648

evrardjp commented 2 years ago

If you dont mind an other PR for full IPVS configuration capabilitys it will follow soon™ Its mostly sysctl related / kernel parameters.

this would bring the potential of this role even further.

That sounds awesome! If we introduce testing around it, it will also be reliable for you in the long run.

evrardjp commented 2 years ago

Sure:

here is a direct playbook example: inventory

[squiddev]
proxydev01.example.tld
proxydev02.example.tld

[loadbalancers]
loadbalancer01.example.tld
loadbalancer02.example.tld

playbook/squiddev-first.yml

---

- hosts: loadbalancers, proxydev01.example.tld
  vars:
    activeconn_threshold: 5

  pre_tasks:
    - name: "exclude {{ groups['squiddev'][0] }} from loadbalancer"
      ansible.builtin.include_role:
        name: ansible-keepalived
        apply:
          tags:
            - keepalived-config
      when: "'loadbalancers' in group_names"
      vars:
        keepalived_virtual_server_groups:
          - name: proxy
            vips:
              - ip: '172.16.10.30'
                port: 3128
            delay_loop: 5
            protocol: TCP
            lvs_sched: wrr
            lvs_method: DR
            persistence_timeout: 120
            real_servers:
              - ip: '172.28.20.31'
                port: 3128
                weight: 0
                tcp_checks:
                  - connect_port: 3128
                    connect_timeout: 1
                    retry: 2
                    delay_before_retry: 2
              - ip: '172.28.20.32'
                port: 3128
                weight: 1
                tcp_checks:
                  - connect_port: 3128
                    connect_timeout: 1
                    retry: 2
                    delay_before_retry: 2

    - name: Force all notified handlers to run at this point, not waiting for normal sync points
      ansible.builtin.meta: flush_handlers

    - name: Verify weight is set to zero
      ansible.builtin.shell:
        cmd: "ipvsadm -L | grep {{ groups['squiddev'][0] }} | awk '{ print $4 }'"
      register: keepalived_weight
      run_once: true
      changed_when: false
      failed_when: keepalived_weight.stdout | int != 0
      delegate_to: "{{ groups['loadbalancers'][0] }}"

    - name: Wait until 'ActiveConn' are below threshold
      ansible.builtin.shell:
        cmd: "ipvsadm -L | grep {{ groups['squiddev'][0] }} | awk '{ print $5 }'"
      register: keepalived_activeconn
      until: keepalived_activeconn.stdout | int <= activeconn_threshold
      retries: 300
      delay: 5
      run_once: true
      changed_when: false
      delegate_to: "{{ groups['loadbalancers'][0] }}"

  tasks:
    - name: "configure squid {{ groups['squiddev'][0] }}"
      ansible.builtin.include_role:
        name: squid
      when: "'squiddev' in group_names"

    - name: Force all notified handlers to run at this point, not waiting for normal sync points
      ansible.builtin.meta: flush_handlers

  post_tasks:
    - name: "include {{ groups['squiddev'][0] }} in loadbalancer"
      ansible.builtin.include_role:
        name: ansible-keepalived
        apply:
          tags:
            - keepalived-config
      when: "'loadbalancers' in group_names"
      vars:
        keepalived_virtual_server_groups:
          - name: proxy
            vips:
              - ip: '172.16.10.30'
                port: 3128
            delay_loop: 5
            protocol: TCP
            lvs_sched: wrr
            lvs_method: DR
            persistence_timeout: 120
            real_servers:
              - ip: '172.28.20.31'
                port: 3128
                weight: 1
                tcp_checks:
                  - connect_port: 3128
                    connect_timeout: 1
                    retry: 2
                    delay_before_retry: 2
              - ip: '172.28.20.32'
                port: 3128
                weight: 1
                tcp_checks:
                  - connect_port: 3128
                    connect_timeout: 1
                    retry: 2
                    delay_before_retry: 2

Great example! I am thinking of building a collection for HA, would it make sense that we include these kind of examples in the collection? WDYT? Let's discuss this into the "discussion" on github !