Closed mikaku closed 4 years ago
As you identify, there was a network problem with mdb1. It looks to me also as though VRRP adverts send from mdb2 and mdb3 to mdb1 have been queued while mdb2 and mdb3 are unable to communicate with mdb1.
How can be possible that server mdb3
was sending adverts if it switched to BACKUP at 09:16:47?
At 09:16:47 mdb3 became master and sent and advert to both mdb1 and mdb2. The advert to mdb1 was unable to be delivered at that time and it was queued somewhere. When network connectivity to mdb1 was restored, the queued packet was delivered, so at 09:16:54 keepalived on mdb1 received the advert sent by mdb3 at 09:15:47.
Why server mdb1
with a priority 150 needs so many elections to switch from BACKUP to MASTER?
As above, the VRRP adverts from mdb2 to mdb1 were queued while mdb2 could not communicate with mdb1. When communications were re-established, the adverts send by mdb2 at 09:16:47, 09:16:48, 09:16:40, 09:16:50, 09:16:51, 09:16:52 and 09:16:53 were received by keepalived on mdb1 at 09:16:54.
By default, whenever keepalived receives a lower priority advert it send gratuitous ARP messages to ensure that the ARP caches on other systems/switches have the correct MAC address for the VIPs (if not using VMACs), and when using VMACs that switches update their caches to have the correct port for the virtual MAC address(es). Also by default when gratuitous ARP messages are send, a second block is sent 5 seconds later, which is what the log entries at 09:16:59 relate to.
What you are seeing in the logs is not 8 elections to be master, in fact mdb1 remains a master throughout, since until 09:16:54 it doesn't see any adverts, and then at 09:16:54 it sees multiple adverts, and since they are all lower priority it simply send ARP messages to ensure the ARP caches are correct.
One question is why does mdb3 take over as master before mdb2. Since the difference in priority between mdb2 and mdb3 is only 1, the time difference between when mdb2 and mdb3 would take over as master is 1/256 second, i.e. about 4ms. With delays in the network which you appear to have, it is possible that the last advert mdb2 received from mdb1 (at 09:16:44) was delayed fractionally, and so at 09:16:47 it timed out fractionally after mdb3 timed out. If you adjust the priorities so that mdb1 is 250, mdb2 is 125 and mdb3 is 1, then you would have more certainty that mdb2 would take over before mdb3 (you could even have priorities 254, 253 and 1 with your particular configuration).
It looks to me also as though VRRP adverts send from mdb2 and mdb3 to mdb1 have been queued while mdb2 and mdb3 are unable to communicate with mdb1.
Ahh, that explains it all. :smiley:
And regarding why mdb3
takes over as master before mdb2
, was always a constant question in me, now it is also clear. I'll follow your recommendations on the priorities values.
Thank you very much!
Describe the issue In fact there are two issues here:
In short the servers are as follows (full configuration below):
mdb1
(the preferred MASTER, withnopreempt
) and with a priority of 150.mdb2
(BACKUP) with priority 50.mdb3
(BACKUP) with priority 49.This is what happened:
At 09:16:47 the
mdb3
server switched to MASTER as it didn't receive any advert frommdb1
(the default MASTER server).Then almost immediately, server
mdb2
also switched to MASTER by the same reason and, still in the same second, the servermdb3
received an advert from the new MASTER servermdb2
who forced elections and won due to its higher priority. So, servermdb3
switched back to BACKUP.The following are the logs in servers
mdb3
andmdb2
that explains all this:Full log from server
mdb3
(IPs substituted by<names>
):Full log from server
mdb2
(IPs substituted by<names>
):So far, so good. Nothing special.
In that moment we have:
mdb1
with no connectivity and thinks is MASTER.mdb2
as the official MASTER.mdb3
as the BACKUP.Some seconds later, at 09:16:54, server
mdb1
recovered network connectivity and received an advert from a MASTER server, and as you can see the priority is 49 which means that it wasmdb3
sending adverts?First part of the log from
mdb1
(IPs substituted by<names>
):Questions
mdb3
was sending adverts if it switched to BACKUP at 09:16:47?mdb1
with a priority 150 needs so many elections to switch from BACKUP to MASTER?This is specially annoying since on every new lower priority received, the server runs the
notify_master_rx_lower_pri
script which in turn resets the virtual (floating) IP excessive times.Last part of the log from
mdb1
, mostly irrelevant (IPs substituted by<names>
):To Reproduce Any steps necessary to reproduce the behaviour:
Expected behavior I expected to have only one election between two servers and the one with greater priority so win clearly.
Keepalived version
Distro (please complete the following information):
Details of any containerisation or hosted service (e.g. AWS) Hetzner cloud servers.
Configuration file: Configuration file for server
mdb1
:Configuration file for server
mdb2
:Configuration file for server
mdb3
: The same asmdb2
changing onlyrouter_id HAP3_L2
andpriority 49
, and of course, the IPs in theunicast_src
andunicast_peer
options.