ClusterLabs / resource-agents

Combined repository of OCF agents from the RHCS and Linux-HA projects
GNU General Public License v2.0
493 stars 579 forks source link

Occasional false positive "down" reports from IPv6addr "monitor" action #1855

Closed rfuchs closed 1 year ago

rfuchs commented 1 year ago

The "monitor" action of IPv6addr is sending an ICMPv6 echo request to the given local address: https://github.com/ClusterLabs/resource-agents/blob/d6b954890b496fcdd8a76d7c2dd44a36fa0ad42c/heartbeat/IPv6addr.c#L630

and is the expecting to receive the respective echo response immediately, without delay, using MSG_DONTWAIT: https://github.com/ClusterLabs/resource-agents/blob/d6b954890b496fcdd8a76d7c2dd44a36fa0ad42c/heartbeat/IPv6addr.c#L647

This works fine most of the time, but occasionally under heavy network load, the echo response is not immediately available, and the recvmsg fails with EAGAIN, leading to a false positive down event ("not running") on the resource:

Mar 28 18:53:59 sp1 pacemaker-schedulerd[41843]:  warning: Unexpected result (not running) was recorded for monitor of p_vip_neth0_v6_1 on sp1 at Mar 28 00:02:38 2023