OOBM Sends the Reset Command to DRAC if Hypervisor is in PowerOff State

rahultolearn59 commented 7 months ago

ISSUE TYPE

Bug Report

COMPONENT NAME

CLOUDSTACK VERSION

CONFIGURATION

OS / ENVIRONMENT

SUMMARY

When a hypervisor with RHEL8/9 running on a Dell PowerEdge Server gets into the power-off state, the ACS management system cannot fence it to recover the VMs on another hypervisor as it keeps sending DRAC a system reset request. Still, Drac has Poweron as the only option while its server is in a power-off state. I waited for 1-2 hours to conclude. As soon as I power on the server manually, VMs are restored on different hypervisors immediately.

I think the server's power status should be checked first, and a power-on/reset command should be sent accordingly.

STEPS TO REPRODUCE

Power off the hypervisor.

EXPECTED RESULTS

VMs should be recovered on active hypervisor.

ACTUAL RESULTS

VMs were not able to recover

boring-cyborg[bot] commented 7 months ago

Thanks for opening your first issue here! Be sure to follow the issue template!

DaanHoogland commented 7 months ago

@rahultolearn59 , I think it pays to send a mail to users@cloudstack.apache.org about this. I have no experience with DRAC but I am sure there are plenty of users that do. Not sure if what you say is a matter of configuration and if others have a work around for it.

rahultolearn59 commented 7 months ago

Thanks, @DaanHoogland! Just for FYI; this is the exception message I see in the logs. 2024-04-16 22:43:30,146 WARN [o.a.c.h.t.BaseHATask] (pool-4-thread-1:null) (logid:3e5b7727) Exception occurred while running RecoveryTask on a resource: org.apache.cloudstack.ha.provider.HARecoveryException: OOBM service is not configured or enabled for this host org.apache.cloudstack.ha.provider.HARecoveryException: OOBM service is not configured or enabled for this host at org.apache.cloudstack.kvm.ha.KVMHAProvider.recover(KVMHAProvider.java:83) at org.apache.cloudstack.kvm.ha.KVMHAProvider.recover(KVMHAProvider.java:42) at org.apache.cloudstack.ha.task.RecoveryTask.performAction(RecoveryTask.java:43) at org.apache.cloudstack.ha.task.BaseHATask$1.call(BaseHATask.java:86) at org.apache.cloudstack.ha.task.BaseHATask$1.call(BaseHATask.java:83) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: org.apache.cloudstack.utils.redfish.RedfishException: Failed to get System power state for host 'GET' with request 'https:///redfish/v1/Systems/System.Embedded.1/Actions/ComputerSystem.Reset: '. The expected HTTP status code is '2XX' but it got '409'.

rohityadavcloud commented 7 months ago

Were the VMs using a HA-enabled service offering @rahultolearn59 ?

rahultolearn59 commented 7 months ago

Yes, @rohityadavcloud; as soon as we manually power on the host, VMs are moved to another KVM host.

rohityadavcloud commented 7 months ago

Is the hypervisor host enabled with OOBM and Host-HA feature @rahultolearn59 ? It's possible the configuration puts Host HA before VM HA. In case your env isn't configure for host HA, disable that and see if VM HA still works?

Could you also share steps of reproducing this? On a high level, how have you configured the env, what is enabled/configured to reproduce this behaviour ?

rahultolearn59 commented 7 months ago

Thanks again for looking into this issue, @rohityadavcloud !

Please find the setup information below:

ACS 4.19 on RHEL9
4 KVM (RHEL8) hosts (Dell PowerEdge R740)
All KVM hosts have OOBM enabled
All the 3 VMs sitting in the cluster are HA-Enabled
Using NFS as primary storage pool

To reproduce:

Power off the KVM host (hosting one or more VM) by either init 0 or from DRAC console
The expectation is that VMs sitting on that host should be powered on onto a different Host in the cluster

Observation:

ACS tries resetting the KVM host via DRAC command (redfish api) before fencing it. But since the host is powered off already, DRAC has PowerON as the only option available
ACS gets into some sort of loop here it seems

tanganellilore commented 3 months ago

Hi team, same issue on my test case. If I simulate two use case:

a disruption with powering off the server, ipmitool command fail because we send powerOff command that return rc=1.
a disruption like power off machine and idrac not reachable, idrac command fail due not reachability.

In both case host remain in "fancing" ad libitum until i restart the server or idrac will be reachable. Obviously, all vm on this host remain on failed host in all tests, until i remove it manually from UI.

I think there are some limit case not considerd (two use case above) and a manage of oobm return code.

I read some of your code and i think that error can be on these piece of code:

https://github.com/apache/cloudstack/blob/b215abc30a22d6b11f016b8f402981445140f577/server/src/main/java/org/apache/cloudstack/ha/task/FenceTask.java#L48-L53

originating from https://github.com/apache/cloudstack/blob/b215abc30a22d6b11f016b8f402981445140f577/server/src/main/java/org/apache/cloudstack/ha/HAManagerImpl.java#L523-L529

First function will enter on fenced state (and fance all vm) only if result is true but this means that we cover only use case when oobm works, and not case when oobm not work or hostis in powerOff state. Moreover, without usage of retry or timeout, like recover, we continue to stay in "loop" on second pice of code, because function return everytime true.

My suggestions are below:

introduce new parameter for maximum retry of fancing state (or maximum fancing time)
better manage of powerOFF, like check if status of machine is "ON" and then powered OFF, if required

In that way if OOBM will not work, we have maximum retry thtat will fence out the host and him VM, if OOBM work but host is in powerOFF, we fenced out immediatly the host

apache / cloudstack