acassen / keepalived

Keepalived
https://www.keepalived.org
GNU General Public License v2.0
3.98k stars 734 forks source link

"no route to host" to VIP from other machines #2145

Closed Zarkaouette closed 2 years ago

Zarkaouette commented 2 years ago

Describe what you need help/support for I'm using VMs in Azure and running reverse-proxy (nginx) on 2 instances. As it is PROD, I need this RP functionnality to be highly-available and then decided to go with Keepalived. After installing/configuring in a MASTER/BACKUP fashion, I'm able to reach the VIP only locally to the MASTER node. If I manually failover the BACKUP node as a MASTER, then again I can reach the service on the VIP locally to the node but not from any other machine located in the same LAN.

Details of what you would like to do with keepalived When I start the keepalived process and have the VIP attributed to a node, it should be accessible from any other VM/device within the same network

Keepalived version v2.0.19 (10/19,2019)

Distro (please complete the following information):

Details of any containerisation or hosted service (e.g. AWS) Both NGINX (RP) and KEEPALIVED are running directly on the host, no containerisation is used, no hosted service.

Configuration file: MASTER Configuration :

global_defs {
  router_id nginx
}

vrrp_script check_nginx {
  script "/bin/check_nginx.sh"
  interval 2
  weight 50
}

vrrp_instance VI_01 {
  state MASTER
  interface eth0
  virtual_router_id 151
  advert_int 1
  use_vmac
    vmac_xmit_base
  priority 101

  virtual_ipaddress {
    XXX.XXX.XXX.10/24
  }

  unicast_src_ip YYY.YYY.YYY.YY # Master IP

  unicast_peer {
   ZZZ.ZZZ.ZZZ.ZZZ dev eth0  # Backup IP
  }

  track_script {
    check_nginx
  }
}

The BACKUP node has the exact same configuration except for the "state" and "priority"

Notify and track scripts Content of the "check_nginx":

#!/bin/sh
if [ -z "`pidof nginx`" ]; then
  exit 1
fi

Logs on the MASTER

Jun 20 09:29:51 vm-express-route-prod-1 Keepalived_vrrp[49013]: Opening file '/etc/keepalived/keepalived.conf'.
Jun 20 09:29:51 vm-express-route-prod-1 Keepalived_vrrp[49013]: WARNING - default user 'keepalived_script' for script execution does not exist - please create.
Jun 20 09:29:51 vm-express-route-prod-1 Keepalived_vrrp[49013]: SECURITY VIOLATION - scripts are being executed but script_security not enabled.
Jun 20 09:29:51 vm-express-route-prod-1 Keepalived_vrrp[49013]: (VI_01): Success creating VMAC interface vrrp.151
Jun 20 09:29:51 vm-express-route-prod-1 Keepalived_vrrp[49013]: NOTICE: setting sysctl net.ipv4.conf.all.rp_filter from 2 to 0
Jun 20 09:29:51 vm-express-route-prod-1 Keepalived_vrrp[49013]: Registering gratuitous ARP shared channel
Jun 20 09:29:51 vm-express-route-prod-1 Keepalived_vrrp[49013]: (VI_01) Entering BACKUP STATE (init)
Jun 20 09:29:51 vm-express-route-prod-1 Keepalived_vrrp[49013]: VRRP_Script(check_nginx) succeeded
Jun 20 09:29:51 vm-express-route-prod-1 Keepalived_vrrp[49013]: (VI_01) Changing effective priority from 101 to 151
Jun 20 09:29:55 vm-express-route-prod-1 Keepalived_vrrp[49013]: (VI_01) Entering MASTER STATE

Logs on the BACKUP

Jun 20 09:29:54 vm-express-route-prod-2 Keepalived_vrrp[48687]: (Line 35) Rule has no preference specified - setting to 16384. This is probably not what you want.
Jun 20 09:29:54 vm-express-route-prod-2 Keepalived_vrrp[48687]: (Line 36) Rule has no preference specified - setting to 16383. This is probably not what you want.
Jun 20 09:29:54 vm-express-route-prod-2 Keepalived_vrrp[48687]: (Line 40) Warning - cannot track route default with no interface specified, not tracking
Jun 20 09:29:54 vm-express-route-prod-2 Keepalived_vrrp[48687]: SECURITY VIOLATION - scripts are being executed but script_security not enabled.
Jun 20 09:29:54 vm-express-route-prod-2 Keepalived_vrrp[48687]: (VI_01): Success creating VMAC interface vrrp.151
Jun 20 09:29:54 vm-express-route-prod-2 Keepalived_vrrp[48687]: NOTICE: setting sysctl net.ipv4.conf.all.rp_filter from 2 to 0
Jun 20 09:29:54 vm-express-route-prod-2 Keepalived_vrrp[48687]: Registering gratuitous ARP shared channel
Jun 20 09:29:54 vm-express-route-prod-2 Keepalived_vrrp[48687]: (VI_01) Entering BACKUP STATE (init)
Jun 20 09:29:54 vm-express-route-prod-2 Keepalived_vrrp[48687]: VRRP_Script(check_nginx) succeeded
Jun 20 09:29:54 vm-express-route-prod-2 Keepalived_vrrp[48687]: (VI_01) Changing effective priority from 100 to 150

So far it's ok.

root@vm-express-route-prod-1:/home/azureuser# ip add show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether XXXXXXXXXX
    inet YYY.YYY.YYY.YY/24 brd YYY.YYY.YYY.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 XXXXXXXXX scope link
       valid_lft forever preferred_lft forever
8: vrrp.151@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether XXXXXXXXX brd ff:ff:ff:ff:ff:ff
    inet XXX.XXX.XXX.10/24 scope global vrrp.151
       valid_lft forever preferred_lft forever

From MASTER :

curl XXX.XXX.XXX.10
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
    body {
        width: 35em;
        margin: 0 auto;
        font-family: Tahoma, Verdana, Arial, sans-serif;
    }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

From another machine in the same LAN (here from BACKUP) :

curl XXX.XXX.XXX.10
curl: (7) Failed to connect to XXX.XXX.XXX.10 port 80: No route to host

As already described, if I stop NGINX on the MASTER, then BACKUP node becomes MASTER. Then, I can observe the same thing : I'm able to curl from the new MASTER on the VIP but not from any other nodes.

I tried playing around IPTABLES and many other options (net.ipv4.conf.all.arp_ignore, net.ipv4.conf.all.arp_annouce, virtual_routes etc...) but nothing seems to work. Any of you guys have any ideas ?

pqarmitage commented 2 years ago

The logs from your backup indicate that you have not provided your complete configuration, since the entries refer to line numbers that don't exist in the configuration you have provided. If you are using ip rules, then you are using policy routing, and that could well relate to the cause of your problem. Indeed, the log messages state that there are missing priorities in the ip rule statements, and so keepalived has used a default priority, and indicated that that may not be what you want.

You are using VMACs and also unicast peers; this doesn't really make sense. If you use VMACs when you should just let keepalived multicast. Also I would remove the vmac_xmit_base.

If you can reach the VIP from the host on which it is configured but not from other hosts on the same LAN, then that suggests it is probably a firewall issue, either on the host where you are running keepalived or on the hosts from which you are trying to connect. When I have these problems I tend to use tcpdump or wireshark ti identify where the packets are being blocked. I also find it can be useful to add firewall rules without any action, e.g. iptables -I INPUT -s YYY.YYY.YYY.YYY -d XXX.XXX.XXX.10 and then you can inspect the counters to see if matching packets are traversing the rule.

First of all I would try removing nginx from the setup and just try pinging the VIP. If that still doesn't work then I suggest you try without running keepalived and manually setting up the ip rules/routes/addresses and get that working. Once that is working then update the keepalived configuration to make sure that it is setting up the same ip rules/routes/addresses as the manual configuration you created.

If you post your manually created ip rules/routes/addresses that work, we can assist in translating those into the keepalived configuration.

Alternatively you could post the output of:

ip rule
ip route

For each table listed in the ip rule output: ip route list table nnn And finally: ip address

If you obfuscate any addresses etc, you will need to change them to valid address (e.g. 10.1.2.3) maintaining the appropriate subnet structure so that we can understand what is happening.

You should not need to add any iptables entries, or change any sysctl values, unless your existing iptables configuration is stopping what you are doing from working, or your existing sysctl values have been changed from defaults.

I am closing this issue since it is not a keepalived issue but rather a network configuration issue, but you will still be able to update the issue and we can respond further.

Zarkaouette commented 2 years ago

Hello @pqarmitage !

First of all I would like to thank you a lot for your prompt and very helpful answer ! It's a real relieve to have you !

Then, I admit that my knowledge on network debug & most importantly on keepalived is very limited. Let me answer you point by point. Also, for security reasons I will use a different network than the one actually in use but keep things the same.

1°) Sorry, looks like I added some extra config in the BACKUP node. Now they are the same, and the BACKUP starts with this output :

Jun 23 01:53:14 vm-express-route-prod-2 Keepalived_vrrp[885]: (VI_01) WARNING - equal priority advert received from remote host with our IP address.
Jun 23 01:53:15 vm-express-route-prod-2 Keepalived_vrrp[885]: (VI_01) WARNING - equal priority advert received from remote host with our IP address.
Jun 23 01:53:16 vm-express-route-prod-2 Keepalived_vrrp[885]: (VI_01) WARNING - equal priority advert received from remote host with our IP address.
Jun 23 01:53:17 vm-express-route-prod-2 Keepalived_vrrp[885]: (VI_01) WARNING - equal priority advert received from remote host with our IP address.
Jun 23 01:53:18 vm-express-route-prod-2 Keepalived_vrrp[885]: (VI_01) WARNING - equal priority advert received from remote host with our IP address.
Jun 23 01:53:19 vm-express-route-prod-2 Keepalived_vrrp[885]: (VI_01) WARNING - equal priority advert received from remote host with our IP address.
Jun 23 01:53:20 vm-express-route-prod-2 Keepalived_vrrp[885]: (VI_01) WARNING - equal priority advert received from remote host with our IP address.
Jun 23 01:53:21 vm-express-route-prod-2 Keepalived_vrrp[885]: (VI_01) WARNING - equal priority advert received from remote host with our IP address.
Jun 23 01:53:22 vm-express-route-prod-2 Keepalived_vrrp[885]: (VI_01) Master received advert from 100.98.227.4 with higher priority 151, ours 150
Jun 23 01:53:22 vm-express-route-prod-2 Keepalived_vrrp[885]: (VI_01) Entering BACKUP STATE

2°) Ok, deleted vmac_xmit_base from the conf. Also I deleted the unicast setting.

3°) About that :

If that still doesn't work then I suggest you try without running keepalived and manually setting up the ip rules/routes/addresses and get that working

I'm not sure about what you mean. My 2 VMs can communicate (ping/curl on whatever port) perfectly well as they are in the same network.

VM A (MASTER) has IP 100.98.227.4 VM B (BACKUP) has IP 100.98.227.5

On VM A (MASTER), I run the following : tcpdump -i eth0 host 100.98.227.5

On VM B (BACKUP) I run a simple ping command on the VIP address

I get the following result :

01:58:34.782022 IP vm-express-route-prod-1 > 100.98.227.5: VRRPv2, Advertisement, vrid 151, prio 151, authtype none, intvl 1s, length 20
01:58:35.782277 IP vm-express-route-prod-1 > 100.98.227.5: VRRPv2, Advertisement, vrid 151, prio 151, authtype none, intvl 1s, length 20
01:58:36.782426 IP vm-express-route-prod-1 > 100.98.227.5: VRRPv2, Advertisement, vrid 151, prio 151, authtype none, intvl 1s, length 20
01:58:37.782600 IP vm-express-route-prod-1 > 100.98.227.5: VRRPv2, Advertisement, vrid 151, prio 151, authtype none, intvl 1s, length 20
01:58:38.782732 IP vm-express-route-prod-1 > 100.98.227.5: VRRPv2, Advertisement, vrid 151, prio 151, authtype none, intvl 1s, length 20
01:58:39.783006 IP vm-express-route-prod-1 > 100.98.227.5: VRRPv2, Advertisement, vrid 151, prio 151, authtype none, intvl 1s, length 20
01:58:40.783153 IP vm-express-route-prod-1 > 100.98.227.5: VRRPv2, Advertisement, vrid 151, prio 151, authtype none, intvl 1s, length 20
01:58:41.783325 IP vm-express-route-prod-1 > 100.98.227.5: VRRPv2, Advertisement, vrid 151, prio 151, authtype none, intvl 1s, length 20
01:58:41.997319 ARP, Request who-has 100.98.227.5 tell vm-express-route-prod-1, length 28
01:58:41.998136 ARP, Reply 100.98.227.5 is-at 12:34:56:78:9a:bc (oui Unknown), length 28
01:58:42.783564 IP vm-express-route-prod-1 > 100.98.227.5: VRRPv2, Advertisement, vrid 151, prio 151, authtype none, intvl 1s, length 20
01:58:43.783751 IP vm-express-route-prod-1 > 100.98.227.5: VRRPv2, Advertisement, vrid 151, prio 151, authtype none, intvl 1s, length 20

4°) On both VM, I have only 2 rules :

0:      from all lookup local
32766:  from all lookup main
32767:  from all lookup default

For both of them, the definition of these rules are as such : Local rule for MASTER (it's the same for BACKUP, except for the src IP) :

broadcast 100.98.227.0 dev eth0 proto kernel scope link src 100.98.227.4
local 100.98.227.4 dev eth0 proto kernel scope host src 100.98.227.4
local 100.98.227.10 dev eth0 proto kernel scope host src 100.98.227.4
broadcast 100.98.227.255 dev eth0 proto kernel scope link src 100.98.227.4
broadcast 127.0.0.0 dev lo proto kernel scope link src 127.0.0.1
local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1
local 127.0.0.1 dev lo proto kernel scope host src 127.0.0.1
broadcast 127.255.255.255 dev lo proto kernel scope link src 127.0.0.1

Main rule for MASTER (again, similar with BACKUP) :

default via 100.98.227.1 dev eth0 proto dhcp src 100.98.227.4 metric 100
100.98.227.0/24 dev eth0 proto kernel scope link src 100.98.227.4
168.63.129.16 via 100.98.227.1 dev eth0 proto dhcp src 100.98.227.4 metric 100
169.254.169.254 via 100.98.227.1 dev eth0 proto dhcp src 100.98.227.4 metric 100

ip route list table default command returns me :

Error: ipv4: FIB table does not exist.
Dump terminated

I really hope this can help because I have no idea what is going on here.... Thanks a lot in advance for your kindness, help and patience :) Have a great day

Zarkaouette commented 2 years ago

Hello @pqarmitage

Just to let you know, this issue seems to be related to Azure itself, as they block all ARP request (for some security reasons). As it is PROD and need reliability anyway we went for the Azure LoadBalancers instead.

Have a great day and thanks a lot anyway for your very valuable help :)