Closed geotransformer closed 1 year ago
Hi, the problem could be connected to netlink
and MII
.
From you configuration:
[...] MII Status: up MII Polling Interval (ms): 100 [...]
As per default behavior, keepalived
use netlink
as event status updater; in your case, seems you need to relay over MII
for those interfaces.
To change this beahavior, take a look to this section (man keepalived.conf
):
[...]
Linkbeat interfaces
The linkbeat_interfaces block allows specifying which interfaces should
use polling via MII, Ethtool or ioctl status rather than rely on
netlink status updates. This allows more granular control of global
definition linkbeat_use_polling.
This option is preferred over the deprecated use of
linkbeat_use_polling in a vrrp_instance block, since
the latter only allows using linkbeat on the interface of the vrrp_in-
stance itself, whereas track_interface and vir-
tual_ipaddresses and virtual_iproutes may require monitoring other in-
terfaces, which may need to use linkbeat polling.
The default polling type to use is MII, unless that isn't supported in
which case ETHTOOL is used, and if that isn't supported then ioctl
polling. The preferred type of polling to use can be specified with MII
or ETHTOOL or IOCTL after the interface name, but if that type isn't
supported, a supported type will be used.
The syntax for linkbeat_interfaces is:
linkbeat_interfaces {
eth2
enp2s0 ETHTOOL
}
[...]
Finally, other users could ask you to send more logs, like:
keepalived
logs.keepalived -D -d
.tcpdump
log of the wrong beahavior.MII we have - MII Polling Interval (ms): 100
~$ sudo cat /proc/net/bonding/bd1
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
Primary Slave: None
Currently Active Slave: enp216s0f1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0
Slave Interface: enp216s0f1
MII Status: up
Speed: 40000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 40:a6:b7:48:5a:71
Slave queue ID: 0
Slave Interface: enp94s0f0
MII Status: up
Speed: 40000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 40:a6:b7:4d:da:c8
Slave queue ID: 0
Notes for setting up test environment:
ip link add bd1 type bond mode active-backup miimon 100 fail_over_mac active
for i in 1 2; do for c in down "master bd1" up; do ip link set enp0s20f0u$i $c; done; done
ip link set bd1 up
ip link add link bd1 vlan100 type vlan id 100
ip link set vlan100 up
ip addr add 10.192.1.13/24 brd + dev vlan100
I have tested this using a 6.2 kernel, and also on Centos 7, since you appear to be using RHEL 7/Centos 7, and I cannot reproduce the problem, unless the netlink receive buffers are set to only 1. When the current active slave goes down, keepalived updates the MAC addresses for both the bd1 and vlan100 interfaces, and the GARP messages that are subsequently sent are sent with the new MAC address of the bd1 interface.
If I set
global_defs {
vrrp_netlink_monitor_rcv_bufs 1
}
then I get log messages:
Netlink: Receive buffer overrun on monitor socket - (No buffer space available)
- increase the relevant netlink_rcv_bufs global parameter and/or set force
although when I tried it the MAC address did get updated for GARP messages.
This is not a bug in keepalived but rather a configuration error, since insufficient netlink receive buffers are being allocated. Increasing the netlink receive buffers stops the problem occurring.
I tested the above with keepalived version 2.2.7. Version 1.3.9 is extremely old and version 2.0.20 is quite old. If the problem persists I suggest you update keepalived to v2.2.7 as a first step to resolving the issue.
Describe the bug Keepalived is broadcasting the wrong MAC of the bond (after bond failover)
The investigation showed the keepalived could miss some netlink event, i.e., bond failover and MAC address update
To Reproduce
1> Might reduce the bufs global_defs { vrrp_netlink_monitor_rcv_bufs 1 } 2> Bond interface failover
Expected behavior A clear and concise description of what you expected to happen.
Keepalived version
Distro (please complete the following information):
Details of any containerisation or hosted service (e.g. AWS) Found the issue in container keepalived AND hosted keepalived service
Configuration file:
Notify and track scripts
System Log entries
Did keepalived coredump?
Additional context Add any other context about the problem here.