Enhancement request for Azure Fence Agent (kdump feature)

grantmarcroft commented 2 years ago

In the Azure cloud, a production system outage often creates the necessity for operations teams to contact the cluster software vendor to obtain a root cause analysis of the unexpected reboot. stonith:external/sbd has a unique "crash" feature to kdump an unhealthy node, making a deeper failure analysis possible. With Azure Fence Agent, there is no such feature.

What makes this feature possible from the platform perspective is the ability to trigger an NMI, which covers the case of the OS being too unresponsive to handle magic sysrq.

https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machines/serial-console-nmi-sysrq#non-maskable-interrupt-nmi

Whether the kdump succeeds or fails due to a hypervisor issue, fence_azure_arm should eventually deallocate the node as per its current function.

As of the time of this writing, some Azure users use a fence topology to first attempt stonith:external/sbd fencing with "crash" and then stonith:fence_azure_arm as a backup fence mechanism in the case SBD (or kdump, itself) fails.

The additional cost of SBD storage device(s) on the platform could be eliminated with this feature.

Before this is recommended by someone else: The stonith:fence_kdump doesn't kdump a node. It is a reactive fence agent used to pause STONITH long enough to collect a kdump before the "shutoff switch is flipped" in the event of a kernel panic.

oalbrigt commented 2 years ago

Do you have a link to the sbd agent? I think it's Suse specific, so I dont know where to find it's source.

grantmarcroft commented 2 years ago

Hello Oyvind.

Here it is: https://github.com/ClusterLabs/fence-agents/blob/main/agents/sbd/fence_sbd.py

SBD source here: https://github.com/ClusterLabs/sbd

And manual page describing crash functionality here:

https://github.com/ClusterLabs/sbd/blob/main/man/sbd.8.pod.in

Grant

On Mon, Nov 07, 2022 at 01:32:18AM -0800, Oyvind Albrigtsen wrote:

Do you have a link to the sbd agent? I think it's Suse specific, so I dont know where to find it's source.

-- Reply to this email directly or view it on GitHub: https://github.com/ClusterLabs/fence-agents/issues/509#issuecomment-1305332610 You are receiving this because you authored the thread.

Message ID: @.***>

wenningerk commented 2 years ago

hmm ... I know working on the sbd-setup wasn't actually what you were asking for but maybe discussion helps getting ahead somehow. Do you have details of this sbd topology setup mentioned? Having sbd on one level and simply fence_azure_arm on the next sounds a bit dangerous to me ... especially as you are mentioning backup mechanism. The fence agent can just verify if writing the poison-pill to the device went ok. The fence-target has to assure by itself that it is either able to read the poision-pill within a timeout or suicide reliably if it can't. What I could imagine would be poison-pill & fence_kdump on one level and fence_azure_arm as backup. That should reliably check if crashing the node had worked - even without a watchdog-device that is considered as reliable enough for sbd. (Are we talking of azure-bare-metal with a supported hardware-watchdog or some setup with softdog that might not be supported together with sbd depending on the distro?) Alternatively to using poison pill I could imagine an sbd-configuration without disks but without telling pacemaker that sbd is there (stonith-watchdog-timeout = 0 or banning fence_watchdog from all nodes for newer pacemaker that supports making the hidden fence_watchdog - that always had been there with watchdog-fencing - visible as an explicit fencing-resource). A topology of fence_kdump on one level and fence_azure_arm on the next should then give the target-node enough time to suicide with a kdump + verify that this worked or fail if it didn't and fall through to azure-fencing. Haven't tried either of those - just ideas ...

ClusterLabs / fence-agents

Enhancement request for Azure Fence Agent (kdump feature) #509