Open rahultolearn59 opened 6 months ago
Thanks for opening your first issue here! Be sure to follow the issue template!
@rahultolearn59 , I think it pays to send a mail to users@cloudstack.apache.org about this. I have no experience with DRAC but I am sure there are plenty of users that do. Not sure if what you say is a matter of configuration and if others have a work around for it.
Thanks, @DaanHoogland! Just for FYI; this is the exception message I see in the logs.
2024-04-16 22:43:30,146 WARN [o.a.c.h.t.BaseHATask] (pool-4-thread-1:null) (logid:3e5b7727) Exception occurred while running RecoveryTask on a resource: org.apache.cloudstack.ha.provider.HARecoveryException: OOBM service is not configured or enabled for this host
Were the VMs using a HA-enabled service offering @rahultolearn59 ?
Yes, @rohityadavcloud; as soon as we manually power on the host, VMs are moved to another KVM host.
Is the hypervisor host enabled with OOBM and Host-HA feature @rahultolearn59 ? It's possible the configuration puts Host HA before VM HA. In case your env isn't configure for host HA, disable that and see if VM HA still works?
Could you also share steps of reproducing this? On a high level, how have you configured the env, what is enabled/configured to reproduce this behaviour ?
Thanks again for looking into this issue, @rohityadavcloud !
Please find the setup information below:
To reproduce:
Observation:
Hi team, same issue on my test case. If I simulate two use case:
ipmitool
command fail because we send powerOff command that return rc=1.In both case host remain in "fancing" ad libitum until i restart the server or idrac will be reachable. Obviously, all vm on this host remain on failed host in all tests, until i remove it manually from UI.
I think there are some limit case not considerd (two use case above) and a manage of oobm return code.
I read some of your code and i think that error can be on these piece of code:
First function will enter on fenced state (and fance all vm) only if result is true
but this means that we cover only use case when oobm works, and not case when oobm not work or hostis in powerOff state.
Moreover, without usage of retry or timeout, like recover, we continue to stay in "loop" on second pice of code, because function return everytime true.
My suggestions are below:
In that way if OOBM will not work, we have maximum retry thtat will fence out the host and him VM, if OOBM work but host is in powerOFF, we fenced out immediatly the host
ISSUE TYPE
COMPONENT NAME
CLOUDSTACK VERSION
CONFIGURATION
OS / ENVIRONMENT
SUMMARY
When a hypervisor with RHEL8/9 running on a Dell PowerEdge Server gets into the power-off state, the ACS management system cannot fence it to recover the VMs on another hypervisor as it keeps sending DRAC a system reset request. Still, Drac has Poweron as the only option while its server is in a power-off state. I waited for 1-2 hours to conclude. As soon as I power on the server manually, VMs are restored on different hypervisors immediately.
I think the server's power status should be checked first, and a power-on/reset command should be sent accordingly.
STEPS TO REPRODUCE
EXPECTED RESULTS
ACTUAL RESULTS