apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform
https://cloudstack.apache.org/
Apache License 2.0
1.97k stars 1.09k forks source link

Host HA not working even after configured oob-management #7543

Open yashi4engg opened 1 year ago

yashi4engg commented 1 year ago

Host HA not working even after configured oob-management ..

backend config - Hypervisor and Management node OS - OEL9 OOB configured to use IPMI for HA .

Host HA not working even in logs can see its getting power status successfully and putting node in alert sate but nothing after that like putting to node in disconnected state/Move VMs and so .

kiranchavala commented 1 year ago

@yashi4engg

Could you please provide more details regarding this issue with proper formatting

or please provide the steps to reproduce the issue

yashi4engg commented 1 year ago

@kiranchavala I am trying to setup Host HA ... as of now i have 3 hypervisors in cluster out of these on one hypervisor i configured OOB management and enabled Host HA ..But when host is going to power off or something else cluster is able to get power status and mark as power status off in UI but its not trying to bring this up or also not marking it disconnected in UI so VMs on hypervisor are keep on showing running but not migrating .Even VMs are marked HA enabled in template.

Our Requirement - We want as soon as any hypervisor went down or crash ..VMs running on that hypervisor should migrate to other hypervisor by default.

Our config :- OS - OEL9 OOB band mgmt - IPMI H/w - Dell R series servers

Steps to Reproduce - Power off hypervisor from IDrac

kiranchavala commented 1 year ago

@yashi4engg what is the hypervisor you are using

Were the Service offerings> compute offering HA enabled ?

Screenshot 2023-05-31 at 9 36 11 AM
yashi4engg commented 1 year ago

@kiranchavala - Compute offering have HA enabled . Hypervisor type - KVM with OEL9 .

If i destroy VM manually from hypervisor it came up as part of HA but if hypervisor itself powered off it wont migrate to another node and keep on showing running in UI .

weizhouapache commented 1 year ago

@yashi4engg is this same as #7520 ?

yashi4engg commented 1 year ago

No in this issue i am looking for host Ha not VM HA ..I configured oob and able to get power status using same but cloudstack is not putting host in maint mode or trying to recover and due to that VMs running on that should visible on cloudstack as running and never migrate.

DaanHoogland commented 1 year ago

@yashi4engg , I am trying to reproduce, using a nested env. In my case the host HA-state cycled through Suspect-Checking-Degraded. This happens both when I shut it down out of bands and when I stop the agent. I see no attempt to fence or reboot. I also see no alert state. Can you add more info on the sequence of states and events?

yashi4engg commented 1 year ago

In my case events happening -

As soon as power off host ...as per cloudstack config it keep on checking host and VM status and its able to get host power status using ipmi tool ... and mark not power status as down/Off in UI but after that it wont try to fence or recover node. Due to that all VMs were running on that host keep on showing Running in UI and host also keep on showing enabled state .

DaanHoogland commented 1 year ago

what I mean is the ha-statuses in the log, do you have a track of that? in the UI it shows less.

DaanHoogland commented 1 year ago

@yashi4engg I think I figured out what the problem is. With centos7 this is working unless you install qemu-kvm-ev. the porblem is the activity check is not giving conclusive answers so it can't know if the host is still active and won't move to the recovering or fencing states. Look for messages containing ActivityCheckFailureUnderThresholdRatio. Do you see these as well? the short version HostHA is not supported for EL versions above 7 (or with qemu upgraded to a higher qemu-kvm-ev version)

works:

qemu-img.x86_64                          10:1.5.3-175.el7_9.6           @updates
qemu-kvm.x86_64                          10:1.5.3-175.el7_9.6           @updates
qemu-kvm-common.x86_64                   10:1.5.3-175.el7_9.6           @updates

doesn't work:

qemu-img-ev.x86_64                       10:2.12.0-44.1.el7_8.1         @centos-qemu-ev
qemu-kvm-common-ev.x86_64                10:2.12.0-44.1.el7_8.1         @centos-qemu-ev
qemu-kvm-ev.x86_64                       10:2.12.0-44.1.el7_8.1         @centos-qemu-ev
yashi4engg commented 1 year ago

@DaanHoogland We are using OEL9 and haven't installed any of package qemu-img-ev.x86_64 10:2.12.0-44.1.el7_8.1 @centos-qemu-ev qemu-kvm-common-ev.x86_64 10:2.12.0-44.1.el7_8.1 @centos-qemu-ev qemu-kvm-ev.x86_64 10:2.12.0-44.1.el7_8.1 @centos-qemu-ev

We have below packages installed - qemu-kvm-7.2.0-14.el9_2.x86_64 qemu-img-7.2.0-14.el9_2.x86_64 qemu-kvm-common-7.2.0-14.el9_2.x86_64

DaanHoogland commented 1 year ago

@DaanHoogland We are using OEL9 and haven't installed any of package

I understand @yashi4engg , i did my testing on both ol9 and centos7 the first didn´t work, the latter did. The reason is that activity checking failed (where ACS asks a host next to the suspect to check for activity)

This is not implemented for the qemu implementations on newer systems. I do not understand the versioning of qemu and it is confusing that 7.2 is on a newer system than 10:2.12 but I think you must ignore the 10: in this.

This really needs a new implementation that is compatible with newer qemu systems.

btw, i haven't tried any ubuntu systems

weizhouapache commented 1 year ago

@DaanHoogland is it because of different qemu versions ?

yashi4engg commented 1 year ago

@DaanHoogland @weizhouapache - In our Infra we are using all hosts with same qemu version running with OEL9.

DaanHoogland commented 1 year ago

It is because "neighbour discovery" doesn´t work, it seems.

weizhouapache commented 1 year ago

moved to 4.18.2.0

yashi4engg commented 6 months ago

@DaanHoogland -we tested it with different storage backend ...like in one setup we have primary storage as ceph and secondary storage as NFS ... In another setup we have primary storage as OCFS2 and secondary storage as NFS .. But it still not working in any of setup.

As OS we tested with OEL8 and OEL9 both are not working.

slavkap commented 6 months ago

Hi @yashi4engg, before CS version 4.19, KVM host HA requires NFS primary storage. I was able to reproduce your problem if I remove the NFS primary storage from my dev. The host HA state never becomes Available. image Can you share if you have similar messages in your agent.log file

2024-02-28 11:35:39,920 DEBUG [kvm.resource.KVMHAChecker] (pool-624-thread-1:null) (logid:5815f313) Checking heart beat with KVMHAChecker for host IP [10.2.26.1] in pools [] 2024-02-28 11:35:39,921 WARN [kvm.resource.KVMHAChecker] (pool-624-thread-1:null) (logid:5815f313) All checks with KVMHAChecker for host IP [10.2.26.1] in pools [] considered it as dead. It may cause a shutdown of the host.

spdinis commented 4 months ago

Hi,

We are preparing a transition from vmware to kvm and we are struggling to get HA to work with the same symptoms.

We are using Cloudstack 4.19.0 we have few test clusters that have nfs mount for heartbeat and we will be using shared mount. for the case we tried with simple NFS and made no difference.

We have the out of band enabled and when we power off the physical host via iDRAC, the host moves to fencing after a while and stays in that status and all VMs that were running on it, keep saying running.

Once we declare manually that the host is degraded, VMs jump straight to another host.

One of the surviving nodes agent logs shows it detects that the host is down as slavkap showed:

2024-04-25 15:34:11,406 WARN [kvm.resource.KVMHAChecker] (pool-636-thread-1:null) (logid:29cd972b) All checks with KVMHAChecker for host IP [10.250.9.154] in pools [e355eb3b-58eb-3ce2-890a-6a7b7263d896, b7f5ff6f-7233-3d79-aa1a-1c1bc233e1c8] considered it as dead. It may cause a shutdown of the host.

So I presume the issue is related to the transition to another state after fencing. We will perform some additional tests using redfish for example, or by try to force some non power off failure, see if it is an issue with the agent detecting that the IPMI is actually off assuming it was a voluntary action.

rahultolearn59 commented 4 months ago

This seems to be the same issue I reported here https://github.com/apache/cloudstack/issues/8918

It looks like KVMHA keeps sending a power reset command to the IDRAC, but since the host was powered off mode already, the only command available is power on.

2024-04-16 22:43:30,146 WARN [o.a.c.h.t.BaseHATask] (pool-4-thread-1:null) (logid:3e5b7727) Exception occurred while running RecoveryTask on a resource: org.apache.cloudstack.ha.provider.HARecoveryException: OOBM service is not configured or enabled for this host org.apache.cloudstack.ha.provider.HARecoveryException: OOBM service is not configured or enabled for this host at org.apache.cloudstack.kvm.ha.KVMHAProvider.recover(KVMHAProvider.java:83) at org.apache.cloudstack.kvm.ha.KVMHAProvider.recover(KVMHAProvider.java:42) at org.apache.cloudstack.ha.task.RecoveryTask.performAction(RecoveryTask.java:43) at org.apache.cloudstack.ha.task.BaseHATask$1.call(BaseHATask.java:86) at org.apache.cloudstack.ha.task.BaseHATask$1.call(BaseHATask.java:83) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: org.apache.cloudstack.utils.redfish.RedfishException: Failed to get System power state for host 'GET' with request 'https:///redfish/v1/Systems/System.Embedded.1/Actions/ComputerSystem.Reset: '. The expected HTTP status code is '2XX' but it got '409'.

spdinis commented 4 months ago

After hours of testing I came up with several different issues.

So yes the #8918 is definitely a thing and confirmed, I tried both with IPMI and redfish, same result, if I for example power the server up and send it to BIOS, ACS immediately picks up the power on estate and fences the host and the VMs immediately bounce to another host. I just not sure if it is a reset or a power off, because seems that, in my case the default fencing is power off. Either way PowerEdge don't support reset or Power off when the server is off. Later I will test this with some HPs DL380s G10 and see how that goes.

This is something that definitely is a bit of a non-sense, if the ACS already knows the server is powered off, why try to power it off and not simply put it in maintenance to bounce the VMs?

Now the other issue which is kind of related, but ends up defeating the purpose is that if there is a power outage in the DC, or accidental cable pool or anything like that, you are basically doomed, what happens is that the server goes to unknown:

and then you get this errors:

2024-04-27 19:48:13,626 DEBUG [o.a.c.u.p.ProcessRunner] (pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Process standard error output command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U [redacted] -P [redacted] chassis power status]: [Get Auth Capabilities error
2024-04-27 19:48:13,626 DEBUG [o.a.c.o.d.i.IpmitoolOutOfBandManagementDriver] (pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) The command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U [redacted] -P [redacted] chassis power status] failed and got the result []. Error: [Get Auth Capabilities error

Seems that the NFS mechanism is for nothing, in the cluster there is consensus that the host is dead and then nothing can happen.

The other thing I have noticed is that if you have more than one NFS Primary storage, while all of them will have the heartbeat files and the cluster members gain consensus, the host doesn't even move to fencing, rather to degraded.

Here are the collection of logs from everything I could find in that case:

2024-04-27 19:36:16,119 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-2:ctx-9b0931e0) (logid:d80ab369) KVMInvestigator was able to determine host 38 is in Disconnected

2024-04-27 19:42:41,901 DEBUG [o.a.c.k.h.KVMHostActivityChecker] (pool-1-thread-12:null) (logid:82c6107b) Investigating Host {"id":38,"name":"ukslo-csl-kvm02.slo.gt-t.net","type":"Routing","uuid":"99c85875-54a2-4ade-bf36-0083711c3c34"} via neighbouring Host {"id":37,"name":"ukslo-csl-kvm01.slo.gt-t.net","type":"Routing","uuid":"b1be18a1-6097-427c-bb04-3278ae5a1a33"}.
2024-04-27 19:42:42,138 DEBUG [o.a.c.k.h.KVMHostActivityChecker] (pool-1-thread-12:null) (logid:82c6107b) Neighbouring Host {"id":37,"name":"ukslo-csl-kvm01.slo.gt-t.net","type":"Routing","uuid":"b1be18a1-6097-427c-bb04-3278ae5a1a33"} returned status [Down] for the investigated Host {"id":38,"name":"ukslo-csl-kvm02.slo.gt-t.net","type":"Routing","uuid":"99c85875-54a2-4ade-bf36-0083711c3c34"}.
2024-04-27 19:42:42,138 DEBUG [o.a.c.k.h.KVMHostActivityChecker] (pool-1-thread-12:null) (logid:82c6107b) Investigating Host {"id":38,"name":"ukslo-csl-kvm02.slo.gt-t.net","type":"Routing","uuid":"99c85875-54a2-4ade-bf36-0083711c3c34"} via neighbouring Host {"id":39,"name":"ukslo-csl-kvm03.slo.gt-t.net","type":"Routing","uuid":"316b97d6-ffca-44c5-874f-e5967b6035e3"}.
2024-04-27 19:42:42,488 DEBUG [o.a.c.k.h.KVMHostActivityChecker] (pool-1-thread-12:null) (logid:82c6107b) Neighbouring Host {"id":39,"name":"ukslo-csl-kvm03.slo.gt-t.net","type":"Routing","uuid":"316b97d6-ffca-44c5-874f-e5967b6035e3"} returned status [Down] for the investigated Host {"id":38,"name":"ukslo-csl-kvm02.slo.gt-t.net","type":"Routing","uuid":"99c85875-54a2-4ade-bf36-0083711c3c34"}.

2024-04-27 19:48:11,604 DEBUG [o.a.c.u.p.ProcessRunner] (pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Preparing command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U [redacted] -P [redacted] chassis power status] to execute.
2024-04-27 19:48:11,605 DEBUG [o.a.c.u.p.ProcessRunner] (pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Submitting command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U [redacted] -P [redacted] chassis power status].
2024-04-27 19:48:11,605 DEBUG [o.a.c.u.p.ProcessRunner] (pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Waiting for a response from command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U [redacted] -P [redacted] chassis power status]. Defined timeout: [60].
2024-04-27 19:48:13,626 DEBUG [o.a.c.u.p.ProcessRunner] (pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Process standard output for command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U [redacted] -P [redacted] chassis power status]: [].
2024-04-27 19:48:13,626 DEBUG [o.a.c.u.p.ProcessRunner] (pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Process standard error output command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U [redacted] -P [redacted] chassis power status]: [Get Auth Capabilities error
2024-04-27 19:48:13,626 DEBUG [o.a.c.o.d.i.IpmitoolOutOfBandManagementDriver] (pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) The command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U [redacted] -P [redacted] chassis power status] failed and got the result []. Error: [Get Auth Capabilities error

2024-04-27 19:50:23,841 DEBUG [o.a.c.h.HAManagerImpl] (BackgroundTaskPollManager-2:ctx-148f4c19) (logid:5038a4c2) Transitioned host HA state from:Degraded to:Suspect due to event:PeriodicRecheckResourceActivity for the host id:38
2024-04-27 19:50:27,951 DEBUG [o.a.c.h.HAManagerImpl] (BackgroundTaskPollManager-6:ctx-e859e219) (logid:ab18a786) Transitioned host HA state from:Suspect to:Checking due to event:PerformActivityCheck for the host id:38
2024-04-27 19:50:28,194 DEBUG [o.a.c.h.HAManagerImpl] (pool-2-thread-1:null) (logid:82c6107b) Transitioned host HA state from:Checking to:Suspect due to event:TooFewActivityCheckSamples for the host id:38
2024-04-27 19:51:00,659 DEBUG [o.a.c.h.HAManagerImpl] (BackgroundTaskPollManager-2:ctx-5cee651c) (logid:51a40d65) Transitioned host HA state from:Suspect to:Checking due to event:PerformActivityCheck for the host id:38
2024-04-27 19:51:00,905 DEBUG [o.a.c.h.HAManagerImpl] (pool-2-thread-25:null) (logid:7e1ca968) Transitioned host HA state from:Checking to:Suspect due to event:TooFewActivityCheckSamples for the host id:38
2024-04-27 19:51:33,394 DEBUG [o.a.c.h.HAManagerImpl] (BackgroundTaskPollManager-5:ctx-2fce5985) (logid:3fde664f) Transitioned host HA state from:Suspect to:Checking due to event:PerformActivityCheck for the host id:38
2024-04-27 19:51:33,634 DEBUG [o.a.c.h.HAManagerImpl] (pool-2-thread-5:null) (logid:66de58d8) Transitioned host HA state from:Checking to:Suspect due to event:TooFewActivityCheckSamples for the host id:38
2024-04-27 19:52:06,140 DEBUG [o.a.c.h.HAManagerImpl] (BackgroundTaskPollManager-6:ctx-dde87f00) (logid:10de929d) Transitioned host HA state from:Suspect to:Checking due to event:PerformActivityCheck for the host id:38
2024-04-27 19:52:06,382 DEBUG [o.a.c.h.HAManagerImpl] (pool-2-thread-7:null) (logid:dd936499) Transitioned host HA state from:Checking to:Suspect due to event:TooFewActivityCheckSamples for the host id:38
2024-04-27 19:54:17,004 DEBUG [o.a.c.h.HAManagerImpl] (BackgroundTaskPollManager-2:ctx-173a7199) (logid:bee908f9) Transitioned host HA state from:Suspect to:Checking due to event:PerformActivityCheck for the host id:38
2024-04-27 19:54:17,246 DEBUG [o.a.c.h.HAManagerImpl] (pool-2-thread-19:null) (logid:3df2c5a8) Transitioned host HA state from:Checking to:Suspect due to event:TooFewActivityCheckSamples for the host id:38
2024-04-27 19:54:49,699 DEBUG [o.a.c.h.HAManagerImpl] (BackgroundTaskPollManager-3:ctx-658f80b7) (logid:2ec0cbbe) Transitioned host HA state from:Suspect to:Checking due to event:PerformActivityCheck for the host id:38
2024-04-27 19:54:49,949 DEBUG [o.a.c.h.HAManagerImpl] (pool-2-thread-15:null) (logid:47634402) Transitioned host HA state from:Checking to:Suspect due to event:TooFewActivityCheckSamples for the host id:38
2024-04-27 19:55:22,483 DEBUG [o.a.c.h.HAManagerImpl] (BackgroundTaskPollManager-4:ctx-948895aa) (logid:a1bb5643) Transitioned host HA state from:Suspect to:Checking due to event:PerformActivityCheck for the host id:38
2024-04-27 19:55:22,723 DEBUG [o.a.c.h.HAManagerImpl] (pool-2-thread-21:null) (logid:432c0642) Transitioned host HA state from:Checking to:Degraded due to event:ActivityCheckFailureUnderThresholdRatio for the host id:38
2024-04-27 20:00:24,278 DEBUG [o.a.c.h.HAManagerImpl] (BackgroundTaskPollManager-3:ctx-c4b31d57) (logid:bd7094fb) Transitioned host HA state from:Degraded to:Suspect due to event:PeriodicRecheckResourceActivity for the host id:38

You can see that the host loops between Degraded and Suspect and Checking, but never moves into fencing.

I will still do some more testing around this, now that I got a better grip on what is happening, an at some point I basically threw all toys out of the pram and tried so many things that I need to do some more segmented investigation in some details. I will play around with the fencing option that exists in the global settings to fence the host if only 1 witness is lost.

I have a call with Shapeblue Monday I will review these findings with them as well.

But one thing is for sure, this mechanism needs some improvement, specially coming from Vmware where HA just works when it has to and deals very well with host isolation. Currently I don't think we will be having a HA on environmental power loss, due to the nature of the of the mechanism that relies on understanding the OOB status and if the OOB chip is unresponsive seems it doesn't have any other action than power off when decides to fence.

rohityadavcloud commented 4 months ago

Isn't this documented, for EL8/9 ipmitool has issues on the distros.

spdinis commented 4 months ago

So after a lot of testing I think it is an expected behavior due to the design.

I have the issue regardless distro, I', using ubuntu 22.04 for example.

Long story short, the HA mechanism won't work at all when the server is powered off or IPMI is unknown that will happen when the server has no power at all.

The odd thing is that the coordination between HA Host and HA VM, if would be better would overcome the problem. I ended up ignoring the HA Host all together and keep using HA VM, that works, it has a caveat that takes around 15 minutes to trigger , but eventually does, after breaching the threshold acceptable of having lost NFS heartbeat.

I have no idea where to manipulate that timer, been trying to look at it, but I have bigger fish to fry at this point so I'm accepting that in a rare circumstance when a host is powered off or looses environmental power it will take +/- 15 minutes for the vm to bounce elsewhere.

So the workaround for me is simply disable Host HA in the cluster/zone/host and be patient when requires an HA.

There are enough things in place to make the mechanism robust is just the coordination between the 2 mechanisms requires some work. But a feature request needs to be raised. I don't consider this a bug, rather a design flaw.

DaanHoogland commented 4 months ago

The odd thing is that the coordination between HA Host and HA VM, if would be better would overcome the problem.

ack, This has been an issue since the inception of host HA and we are open to suggestions on that part. Once we have a feasible design the next issue with that is the availability of implementation and testing resources (i.e. people pouncing at keyboards).

rohityadavcloud commented 2 months ago

There is a known issue with ipmitool version on EL8 and EL9, it's worth checking if OOBM is working in the first place.