acassen / keepalived

Keepalived
https://www.keepalived.org
GNU General Public License v2.0
4.02k stars 734 forks source link

keepalived not receiving the multicast message #1440

Closed LSChyi closed 5 years ago

LSChyi commented 5 years ago

Describe the bug In ESXi 6.7, I set up two VMs with keepalived (see below the configuration). As this case is using the same priority, keepalived should follow the VRRP protocol and assign the MASTER role to the host with the higher IP address. However, sometimes both hosts become MASTER. I used tcpdump -Q in on both hosts to verify that the VRRP multicast packets arrive on both hosts. Although the packets arrive at each host (and on the right interface), the keepalived process on one host is not processing them, whereas on the other host I get the expected log entry Received lower prio advert 100, forcing new election. Generally, VRRP messages are processed on both hosts.

To Reproduce An easy way to reproduce the problem is by swapping the IPs of the interfaces used by VRRP on an already running keepalived group. To do so change them first in the Ubuntu configuration (/etc/network/interfaces) on both hosts and then reboot them.

Expected behavior Both keepalived processes on each host should receive the VRRP multicast message and elect a new MASTER.

Keepalived version Reproducible on both 1.2.24 and 2.0.19.

Distro (please complete the following information):

Details of any containerisation or hosted service (e.g. AWS) No

Configuration file: Both hosts use the same configuration

vrrp_instance failover_link {
    state MASTER
    interface ens192
    priority 100
    virtual_router_id 15
}

Notify and track scripts No

System Log entries The following logs are generated by keepalived 1.2.24. The logs for the host that is not receiving the missing VRRP multicast message:

-- Logs begin at Tue 2019-11-19 22:04:14 EST, end at Tue 2019-11-19 22:19:27 EST. --
Nov 19 22:04:17 grouter systemd[1]: Starting Keepalive Daemon (LVS and VRRP)...
Nov 19 22:04:18 grouter Keepalived[1129]: Starting Keepalived v1.2.24 (02/14,2019)
Nov 19 22:04:18 grouter Keepalived[1129]: Opening file '/etc/keepalived/keepalived.conf'.
Nov 19 22:04:18 grouter Keepalived[1232]: Starting Healthcheck child process, pid=1233
Nov 19 22:04:18 grouter Keepalived[1232]: Starting VRRP child process, pid=1234
Nov 19 22:04:18 grouter Keepalived_healthcheckers[1233]: Initializing ipvs
Nov 19 22:04:18 grouter Keepalived_vrrp[1234]: Registering Kernel netlink reflector
Nov 19 22:04:18 grouter Keepalived_vrrp[1234]: Registering Kernel netlink command channel
Nov 19 22:04:18 grouter Keepalived_vrrp[1234]: Registering gratuitous ARP shared channel
Nov 19 22:04:18 grouter Keepalived_vrrp[1234]: Unable to load ipset library
Nov 19 22:04:18 grouter Keepalived_vrrp[1234]: Unable to initialise ipsets
Nov 19 22:04:18 grouter Keepalived_vrrp[1234]: Opening file '/etc/keepalived/keepalived.conf'.
Nov 19 22:04:18 grouter Keepalived_vrrp[1234]: Using LinkWatch kernel netlink reflector...
Nov 19 22:04:18 grouter Keepalived_healthcheckers[1233]: Registering Kernel netlink reflector
Nov 19 22:04:18 grouter Keepalived_healthcheckers[1233]: Registering Kernel netlink command channel
Nov 19 22:04:18 grouter Keepalived_healthcheckers[1233]: Opening file '/etc/keepalived/keepalived.conf'.
Nov 19 22:04:18 grouter Keepalived_healthcheckers[1233]: Using LinkWatch kernel netlink reflector...
Nov 19 22:04:18 grouter systemd[1]: Started Keepalive Daemon (LVS and VRRP).
Nov 19 22:04:19 grouter Keepalived_vrrp[1234]: VRRP_Instance(failover_link) Transition to MASTER STATE
Nov 19 22:04:20 grouter Keepalived_vrrp[1234]: VRRP_Instance(failover_link) Entering MASTER STATE

And the logs from the other host that receives the VRRP multicast message:

-- Logs begin at Tue 2019-11-19 22:04:10 EST, end at Tue 2019-11-19 22:30:09 EST. --
Nov 19 22:04:12 grouter systemd[1]: Starting Keepalive Daemon (LVS and VRRP)...
Nov 19 22:04:13 grouter Keepalived[1101]: Starting Keepalived v1.2.24 (02/14,2019)
Nov 19 22:04:13 grouter Keepalived[1101]: Opening file '/etc/keepalived/keepalived.conf'.
Nov 19 22:04:13 grouter Keepalived[1213]: Starting Healthcheck child process, pid=1214
Nov 19 22:04:13 grouter Keepalived_healthcheckers[1214]: Initializing ipvs
Nov 19 22:04:13 grouter Keepalived[1213]: Starting VRRP child process, pid=1215
Nov 19 22:04:13 grouter Keepalived_vrrp[1215]: Registering Kernel netlink reflector
Nov 19 22:04:13 grouter Keepalived_vrrp[1215]: Registering Kernel netlink command channel
Nov 19 22:04:13 grouter Keepalived_vrrp[1215]: Registering gratuitous ARP shared channel
Nov 19 22:04:13 grouter Keepalived_vrrp[1215]: Unable to load ipset library
Nov 19 22:04:13 grouter Keepalived_vrrp[1215]: Unable to initialise ipsets
Nov 19 22:04:13 grouter Keepalived_vrrp[1215]: Opening file '/etc/keepalived/keepalived.conf'.
Nov 19 22:04:13 grouter Keepalived_healthcheckers[1214]: Registering Kernel netlink reflector
Nov 19 22:04:13 grouter Keepalived_healthcheckers[1214]: Registering Kernel netlink command channel
Nov 19 22:04:13 grouter Keepalived_healthcheckers[1214]: Opening file '/etc/keepalived/keepalived.conf'.
Nov 19 22:04:13 grouter Keepalived_healthcheckers[1214]: Using LinkWatch kernel netlink reflector...
Nov 19 22:04:13 grouter Keepalived_vrrp[1215]: Using LinkWatch kernel netlink reflector...
Nov 19 22:04:13 grouter systemd[1]: Started Keepalive Daemon (LVS and VRRP).
Nov 19 22:04:14 grouter Keepalived_vrrp[1215]: VRRP_Instance(failover_link) Transition to MASTER STATE
Nov 19 22:04:15 grouter Keepalived_vrrp[1215]: VRRP_Instance(failover_link) Entering MASTER STATE
Nov 19 22:04:19 grouter Keepalived_vrrp[1215]: VRRP_Instance(failover_link) Received lower prio advert 100, forcing new election
Nov 19 22:04:20 grouter Keepalived_vrrp[1215]: VRRP_Instance(failover_link) Received lower prio advert 100, forcing new election
Nov 19 22:04:21 grouter Keepalived_vrrp[1215]: VRRP_Instance(failover_link) Received lower prio advert 100, forcing new election
Nov 19 22:04:22 grouter Keepalived_vrrp[1215]: VRRP_Instance(failover_link) Received lower prio advert 100, forcing new election
Nov 19 22:04:23 grouter Keepalived_vrrp[1215]: VRRP_Instance(failover_link) Received lower prio advert 100, forcing new election
Nov 19 22:04:24 grouter Keepalived_vrrp[1215]: VRRP_Instance(failover_link) Received lower prio advert 100, forcing new election
Nov 19 22:04:25 grouter Keepalived_vrrp[1215]: VRRP_Instance(failover_link) Received lower prio advert 100, forcing new election
Nov 19 22:04:26 grouter Keepalived_vrrp[1215]: VRRP_Instance(failover_link) Received lower prio advert 100, forcing new election

Did keepalived coredump? No

Additional context No

pqarmitage commented 5 years ago

You state As this case is using the same priority, keepalived should follow the VRRP protocol and assign the MASTER role to the host with the higher IP address. This is an incorrect interpretation of the VRRP RFC(s). If a virtual router in backup state receives an advert of equal priority to its own priority, it treats it as a valid advert and processes it the same as if the priority in the advert were higher than its own. Only if the priority in the advert is less that the receiver's priority is the advert discarded.

See RFC5798 section 6.4.2 (backup state):

  (420) - If an ADVERTISEMENT is received, then:
         (425) + If the Priority in the ADVERTISEMENT is zero, then:
            (430) * Set the Master_Down_Timer to Skew_Time
         (440) + else // priority non-zero
            (445) * If Preempt_Mode is False, or if the Priority in the ADVERTISEMENT is greater than or equal to the local Priority, then:
               (450) @ Set Master_Adver_Interval to Adver Interval contained in the ADVERTISEMENT
               (455) @ Recompute the Master_Down_Interval
               (460) @ Reset the Master_Down_Timer to Master_Down_Interval
            (465) * else // preempt was true or priority was less
               (470) @ Discard the ADVERTISEMENT
            (475) *endif // preempt test
         (480) +endif // was priority zero?
      (485) -endif // was advertisement recv?
   (490) endwhile // Backup state

The only time when IP address comparison is used to determine which virtual router should be master is if a master receives an advert with equal priority, then the one with the lower IP address drops back to backup. See RFC5798 section 6.4.3:

            (725) -* If the Priority in the ADVERTISEMENT is greater than the local Priority,
            (730) -* or
            (735) -* If the Priority in the ADVERTISEMENT is equal to
            the local Priority and the primary IPvX Address of the
            sender is greater than the local primary IPvX Address, then:

There is no known issue of keepalived not received multicast adverts when the environment is correctly set up, and indeed you indicate Generally, VRRP messages are processed on both hosts, which indicates that keepalived can receive them. We always find that when keepalived is not receiving adverts, especially in virtualised or containerised environments, that the issue comes down to the setup of the environment.

If you wish to take this further, can you please ensure you use version v2.0.19 of keepalived rather than v1.2.24, since that is the version against which we would diagnose the problem, and keepalived works quite differently in the two versions. It will also be necessary to provide far more information than just keepalived is not receiving multicast messages, since we need to have some information to work with, and there are thousands of keepalived installations that are working as expected. For example, one piece of information that coud be provided is does the output of netstat -anp show a large Recv-Q for the keepalived receive socket, but that is just one small example.

In order to see more about what is happening inside keepalived, you could build keepalived with the --enable-epoll-debug --enable-epoll-thread-dump --enable-log-file configure options, and then run keepalived with the -D -g/tmp/keepalived.log -G --flush-log-file --debug=EvDv. This will write log output to files /tmp/keepalived*.log rather than syslog, and produce lots of debugging output showing what epoll events are received, what threads are queued and what functions are being called. From this it is possible to see whether keepalived is receiving adverts and how it is processing them.

Since this does not appear to be a keepalived issue I am closing it for now, but if you provide more information that demonstrates that it is keepalived is not working properly, we can reopen the issue.