indexserver event capture and RA integrated action logic

ja9fuchs commented 9 months ago

The following is an integrated solution of dealing with an indexserver service crash/restart, which takes the secondary node's health into consideration and thus enhances the resiliency of the HANA scale-up environment in such a situation.

The solution consists of 2 parts:

1) A new simple hook script captures events of the indexserver service and updates a cluster node attribute for both nodes individually to make the service state information available to the resource agent. The hook is a generic event filter and could later be extended to also deal with additional services, if desired.

2) The SAPHana resource agent has been enhanced with the logic to process the service state information and, based on the secondary node's health, is able to take the decision to fail the primary SAPHana resource, which leads to a standard failover reaction. However, when the secondary is not in a healthy state, the function will only log information about the situation and leave the primary to keep recovering locally.

ja9fuchs commented 9 months ago

Hi @fmherschel, I understand that the existing solution is working and is used successfully by customers.

My objective is to include the health of the secondary node in the decision-making process to avoid an action on the primary DB, if the secondary is not in the position to take over. By handing over an indexserver status information as a cluster node attribute and by integrating the processing logic into the resource agent, it can be run as part of the regular monitor and an action is only triggered when the cluster health requirements are met.

This has been a request raised by several customers and partners of Red Hat.

fmherschel commented 9 months ago

@ja9fuchs But thats wrong! If you have a long-dying indexserver it is a good option to get active on the pimary. E.g. a fencing with a node restart could recover faster as waiting for the long-dying indexserver. At least this is what we have seen at customer side. The no-go of the new code is that it intervents without (again) relying on the SOK. And as our solution already covers what customers are asking for we do not need a new parallel solution which changes the RA in a non-critical way. Please read our hook script and documentation. If something is missing you could open an issue and then we could discuss, if we could (also) solve that. Thanks!

fmherschel commented 9 months ago

Our solution also works on the secondary. This is in deed important! A blocking indexserver on the secondary does also effect the primary: a) The log area could get critical usage% because the logs needs all to be kept till they could send to the secondary b) The blocking secondary in unsane memory repairment did directly harm the primary, so the primary side also got stuck.

ja9fuchs commented 9 months ago

Converted the PR to draft for the time being. More research and tests are going on and we will decide later how we best move on with the matter.

ja9fuchs commented 6 months ago

I'm closing this PR after a re-evaluation of our customers' requirements and testing of the existing solution via susChkSrv.py. Thanks again for the discussion and valuable information outside of this PR.

SUSE / SAPHanaSR

indexserver event capture and RA integrated action logic #219