NeoAssist / docker-keepalived

Dockerized keepalived to ease HA in deployments with multiple hosts. Provides failover for Virtual IPs (VIP) to be always online even if a host fails. Initially aimed to help Rancher HA deployments
MIT License
65 stars 37 forks source link

Latest image from docker hub doesn't work on CentOS 7 #23

Open claflico opened 6 years ago

claflico commented 6 years ago

Spun up some new load balancers docker hosts last night and attempted to migrate the keepalived service to those hosts but the VIP would never come up.

This is a snippet of the logs:

9/19/2018 1:56:29 PMWed Sep 19 13:56:29 2018: VRRP sockpool: [ifindex(2), proto(112), unicast(0), fd(8,9)]
9/19/2018 1:56:29 PMWed Sep 19 13:56:29 2018: Script `chk_haproxy` now returning 2
9/19/2018 1:56:29 PMWed Sep 19 13:56:29 2018: VRRP_Script(chk_haproxy) failed (exited with status 2)
9/19/2018 1:56:29 PMWed Sep 19 13:56:29 2018: (lb-vips) Entering FAULT STATE
9/19/2018 1:56:29 PMWed Sep 19 13:56:29 2018: Kernel/system configuration issue causing multicast packets to be received but IP_MULTICAST_ALL unset
9/19/2018 1:56:31 PMDisplaying resulting /etc/keepalived/keepalived.conf contents...
9/19/2018 1:56:31 PMWed Sep 19 13:56:31 2018: Starting Keepalived v2.0.4 (06/24,2018), git commit v3.8.0_rc8-47-g5ec10636b6
9/19/2018 1:56:31 PMWed Sep 19 13:56:31 2018: WARNING - keepalived was build for newer Linux 4.4.6, running on Linux 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018
9/19/2018 1:56:31 PMWed Sep 19 13:56:31 2018: Opening file '/etc/keepalived/keepalived.conf'.
9/19/2018 1:56:31 PM    global_defs {
9/19/2018 1:56:31 PM        #Hostname will be used by default
9/19/2018 1:56:31 PM        #router_id your_name
9/19/2018 1:56:31 PM        vrrp_version 2
9/19/2018 1:56:31 PM        vrrp_garp_master_delay 1
9/19/2018 1:56:31 PM        vrrp_garp_master_refresh 60
9/19/2018 1:56:31 PM        #Uncomment the next line if you'd like to use unique multicast groups
9/19/2018 1:56:31 PM        #vrrp_mcast_group4 224.0.0.12
9/19/2018 1:56:31 PM        script_user root
9/19/2018 1:56:31 PM    }
9/19/2018 1:56:31 PM
9/19/2018 1:56:31 PM    vrrp_script chk_haproxy {
9/19/2018 1:56:31 PM        script       "iptables -t nat -nL CATTLE_PREROUTING | grep ':80'"
9/19/2018 1:56:31 PM        timeout 1
9/19/2018 1:56:31 PM        interval 1   # check every 1 second
9/19/2018 1:56:31 PM        fall 2       # require 2 failures for KO
9/19/2018 1:56:31 PM        rise 2       # require 2 successes for OK
9/19/2018 1:56:31 PM    }
9/19/2018 1:56:31 PM
9/19/2018 1:56:31 PM    vrrp_instance lb-vips {
9/19/2018 1:56:31 PM        state BACKUP
9/19/2018 1:56:31 PM        interface eth0
9/19/2018 1:56:31 PM        virtual_router_id 12
9/19/2018 1:56:31 PM        priority 100
9/19/2018 1:56:31 PM        advert_int 1
9/19/2018 1:56:31 PM        nopreempt #Prevent fail-back
9/19/2018 1:56:31 PM        track_script {
9/19/2018 1:56:31 PM            chk_haproxy
9/19/2018 1:56:31 PM        }
9/19/2018 1:56:31 PM        authentication {
9/19/2018 1:56:31 PM            auth_type PASS
9/19/2018 1:56:31 PM            auth_pass blahblah
9/19/2018 1:56:31 PM        }
9/19/2018 1:56:31 PM        virtual_ipaddress {
9/19/2018 1:56:31 PM            10.XX.XX.12/24 dev eth0
9/19/2018 1:56:31 PM        }
9/19/2018 1:56:31 PM    }
9/19/2018 1:56:31 PMStarting Keepalived in the background...
9/19/2018 1:56:31 PMWed Sep 19 13:56:31 2018: daemon is already running
9/19/2018 1:56:31 PM/usr/bin/keepalived.sh: line 101: wait: pid 19 is not a child of this shell

I saw that the new hosts were using an image that was created 5 weeks ago. I went to the previous host that had the image that was created 13 months ago, tagged it & pushed it to our Docker image server. I configured the service to use that tagged image and the VIP came up on the new hosts so there's something in this new image since it's the only thing that changed.

Also, the check port script should probably be changed from grep ':${CHECK_PORT}'" to grep 'dpt:${CHECK_PORT} '" because otherwise the script could show a false positive when something is also running on port 8000 (i.e.traefik) on that host:

iptables -t nat -nL CATTLE_PREROUTING | grep ':80'
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:80 to:10.XX.XX.45:80
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:80 ADDRTYPE match dst-type LOCAL to:10.XX.XX.45:80
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:8000 to:10.XX.XX.45:8000
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:8000 ADDRTYPE match dst-type LOCAL to:10.XX.XX.45:8000
sjiveson commented 6 years ago

Hey, looks like the use of the wait command. I remember having issues with this a while back on a different OS. I will update tomorrow.

sjiveson commented 6 years ago

Thanks for your patience. Any chance you can try replacing lines 100-103 in the keepalived.sh file with what follows, rebuilding the container and seeing if that works better:

while true; do

  # Check if Keepalived is STILL running by recording it's PID (if it's not running $pid will be null):
  pid=$(pidof keepalived)
  # If it is not, lets kill our PID1 process (this script) by breaking out of this while loop:
  # This ensures Docker 'sees' the failure and handles it as necessary
  if [ -z "$pid" ]; then
    echo "Keepalived is no longer running, exiting so Docker can restart the container..."
    break
  fi

  # If it is, give the CPU a rest
  sleep 0.5

done

I can do so myself and test accordingly but it might be a couple of days.

sjiveson commented 6 years ago

Hey Cory, thanks again for your patience, I've made the necessary changes. Please rebuild, test as appropriate and let me know if you have any further issues. I've tested and it works for me.