apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform
https://cloudstack.apache.org/
Apache License 2.0
2.09k stars 1.11k forks source link

OOBM Sends the Reset Command to DRAC if Hypervisor is in PowerOff State #8918

Open rahultolearn59 opened 6 months ago

rahultolearn59 commented 6 months ago
ISSUE TYPE
COMPONENT NAME
CLOUDSTACK VERSION
CONFIGURATION
OS / ENVIRONMENT
SUMMARY

When a hypervisor with RHEL8/9 running on a Dell PowerEdge Server gets into the power-off state, the ACS management system cannot fence it to recover the VMs on another hypervisor as it keeps sending DRAC a system reset request. Still, Drac has Poweron as the only option while its server is in a power-off state. I waited for 1-2 hours to conclude. As soon as I power on the server manually, VMs are restored on different hypervisors immediately.

I think the server's power status should be checked first, and a power-on/reset command should be sent accordingly.

STEPS TO REPRODUCE
Power off the hypervisor.
EXPECTED RESULTS
VMs should be recovered on active hypervisor.
ACTUAL RESULTS
VMs were not able to recover
boring-cyborg[bot] commented 6 months ago

Thanks for opening your first issue here! Be sure to follow the issue template!

DaanHoogland commented 6 months ago

@rahultolearn59 , I think it pays to send a mail to users@cloudstack.apache.org about this. I have no experience with DRAC but I am sure there are plenty of users that do. Not sure if what you say is a matter of configuration and if others have a work around for it.

rahultolearn59 commented 6 months ago

Thanks, @DaanHoogland! Just for FYI; this is the exception message I see in the logs. 2024-04-16 22:43:30,146 WARN [o.a.c.h.t.BaseHATask] (pool-4-thread-1:null) (logid:3e5b7727) Exception occurred while running RecoveryTask on a resource: org.apache.cloudstack.ha.provider.HARecoveryException: OOBM service is not configured or enabled for this host org.apache.cloudstack.ha.provider.HARecoveryException: OOBM service is not configured or enabled for this host at org.apache.cloudstack.kvm.ha.KVMHAProvider.recover(KVMHAProvider.java:83) at org.apache.cloudstack.kvm.ha.KVMHAProvider.recover(KVMHAProvider.java:42) at org.apache.cloudstack.ha.task.RecoveryTask.performAction(RecoveryTask.java:43) at org.apache.cloudstack.ha.task.BaseHATask$1.call(BaseHATask.java:86) at org.apache.cloudstack.ha.task.BaseHATask$1.call(BaseHATask.java:83) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: org.apache.cloudstack.utils.redfish.RedfishException: Failed to get System power state for host 'GET' with request 'https:///redfish/v1/Systems/System.Embedded.1/Actions/ComputerSystem.Reset: '. The expected HTTP status code is '2XX' but it got '409'.

rohityadavcloud commented 6 months ago

Were the VMs using a HA-enabled service offering @rahultolearn59 ?

rahultolearn59 commented 6 months ago

Yes, @rohityadavcloud; as soon as we manually power on the host, VMs are moved to another KVM host.

rohityadavcloud commented 6 months ago

Is the hypervisor host enabled with OOBM and Host-HA feature @rahultolearn59 ? It's possible the configuration puts Host HA before VM HA. In case your env isn't configure for host HA, disable that and see if VM HA still works?

Could you also share steps of reproducing this? On a high level, how have you configured the env, what is enabled/configured to reproduce this behaviour ?

rahultolearn59 commented 6 months ago

Thanks again for looking into this issue, @rohityadavcloud !

Please find the setup information below:

To reproduce:

Observation:

tanganellilore commented 2 months ago

Hi team, same issue on my test case. If I simulate two use case:

In both case host remain in "fancing" ad libitum until i restart the server or idrac will be reachable. Obviously, all vm on this host remain on failed host in all tests, until i remove it manually from UI.

I think there are some limit case not considerd (two use case above) and a manage of oobm return code.

I read some of your code and i think that error can be on these piece of code:

https://github.com/apache/cloudstack/blob/b215abc30a22d6b11f016b8f402981445140f577/server/src/main/java/org/apache/cloudstack/ha/task/FenceTask.java#L48-L53

originating from https://github.com/apache/cloudstack/blob/b215abc30a22d6b11f016b8f402981445140f577/server/src/main/java/org/apache/cloudstack/ha/HAManagerImpl.java#L523-L529

First function will enter on fenced state (and fance all vm) only if result is true but this means that we cover only use case when oobm works, and not case when oobm not work or hostis in powerOff state. Moreover, without usage of retry or timeout, like recover, we continue to stay in "loop" on second pice of code, because function return everytime true.

My suggestions are below:

In that way if OOBM will not work, we have maximum retry thtat will fence out the host and him VM, if OOBM work but host is in powerOFF, we fenced out immediatly the host