Closed OndrejHome closed 6 years ago
Thanks for your issue-report and suggested changes. In the past we have had a reason why we did NOT fail, if the takeover fails or the sr_state was (still) reporting to be a secondary. Give me some time to think about what happened in the past, because that might be a reason to reject you code change.
Thanks for quick response. From code it looks to me that there is some note just after takeover that suggests that there might be more states that this can detect, but the actual thing(line 2469) is completely at the end of the function. So no matter what the rc
variable is set during the saphana_promote_clone
it will be in the end 0 == $OCF_SUCCESS
.
HANA_CALL --timeout inf --use-su --cmd "hdbnsutil -sr_takeover"
#
# now gain check, if we are primary NOW
#
# TODO: PRIO3: check, if we need to destinguish between HANA_STATE_PRIMARY, HANA_STATE_SECONDARY, HANA_STATE_DEFECT
#
if check_for_primary; then
rc=$OCF_SUCCESS;
else
rc=$OCF_FAILED_MASTER
fi
;;
* )
super_ocf_log err "ACT: HANA SYNC STATUS IS NOT 'SOK' SO THIS HANA SITE COULD NOT BE PROMOTED"
rc=$OCF_ERR_GENERIC
;;
esac
else
#
# neither MASTER nor SLAVE - This clone instance seams to be broken!!
# bsc#1027098 - do not stop SAP HANA if "only" HANA state is not correct
# Let next monitor find, if that HANA instance is available or not
rc=$OCF_SUCCESS;
fi
fi
rc=$?
The 'monitor' operation just after this code would detect that node on which Master should be running returns Slave and cluster will consider this to be 'FAILED Master' scenario. So this change is mostly to report sooner that we have failed to promote instead of promoting and then detecting that Master is not Master. But in case that this has reason to be this way it might be a good thing to document it.
If any help needed with this, feel free let me know.
One new example that I have seen for this includes attempt where we do promote
, then we see the message ACT: HANA SYNC STATUS IS NOT 'SOK' SO THIS HANA SITE COULD NOT BE PROMOTED
and then report rc=0
for promote
operation. For this situation I think it doesn't make sense to return 0
as it is returned now or it is at least very confusing to see both error and success reports of promote
.
SAPHana(SAPHana_SID)[zzzz]: INFO: RA ==== begin action promote_clone (0.152.21) ====
[xxxx] yyy attrd: info: attrd_peer_update: Setting hana_sid_clone_state[yyy]: DEMOTED -> PROMOTED from yyy
SAPHana(SAPHana_SID)[zzzz]: ERROR: ACT: HANA SYNC STATUS IS NOT 'SOK' SO THIS HANA SITE COULD NOT BE PROMOTED
SAPHana(SAPHana_SID)[zzzz]: INFO: RA ==== end action promote_clone with rc=0 (0.152.21) (3s)====
@OndrejHome, do you know what the default behaviour of pacemaker is when it is detected that a promote has failed? if pacemaker tries to fence the node by default if a promote fails then this might be the reason why the promote_clone function always returns 0?
Fencing the node when the promote fails might lead to data corruption, because the HANA instance on the node is forcibly killed and this might be the only HANA instance in the cluster that still has the correct set of data at that point in time.
Does always returning 0 by the promote_clone function cause issues with correct cluster behaviour, or is this more of a 'cosmetic' issue because 'rc=0' is written to te logs even though an error occured?
@fdanapfel. default behaviour when promote fails is to 'demote'->'stop'->'start' the resource. If all these operations succeeds there is no fencing. if 'stop' fails then fencing occurs by default.
Returning always 0 in promote makes log analysis harder as DC node will report from time to time state of cluster in logs including the states of resources where for a while SAPHana would look like Master resource even if it is not and will make it harder to find in logs occasion when 'promote failed'.
In any case it also makes a lot of code that try to determine the value for 'rc' useless if we just end with 0.
So while mostly cosmetic it makes life easier to find failed promote by exit code rather than by analyzing the error messages outputed by resource agents that may change while exit codes are fairly stable.
Does above make sense?
@OndrejHome, it makes sense, but considering that stopping and starting the HANA instance when a promote fails could lead to data corruption/data loss inside the database (as you can see in your example the error reported was that the sync status was not OK, therefore it might have been that HANA System Replication was still in the process of syncing data between the sites, and stopping one of the sites in this situation wouldn't be a good idea) I would consider not being able to find a failed promote by checking the exit code the lesser of evils.
@OndrejHome : Changing the rc for promote might make things easier in very seldom situations for the log analysis. However changing the rc could cause regressions as it could enforce the cluster to stop a SAP HANA where in the past the cluster would not have stopped it. Stopping a SAP HANA for only making the life more easy for supportes is not really a good argument for an enterprise soultion. Restarting a SAP HANA could without havign the need would cause very long down times, even if only the secondary is restarted without having the need this could harm the primary, because of the stopped log shipping.
Upps - I wrote my last comment about 2-3 hours ago, but it has just been added to the thread :/
@fdanapfel , @fmherschel - If the something in 'promote' failed and HANA is not Primary then the first 'monitor' operation will detect this and in most cases will send HANA through 'demote'->'stop'->'start' procedure. Having a proper exit code when operation fails is not only for supporters but also for people reading the code of resource agent - it is a bit strange doing several checks to then just throw any results from them away and return hard-coded success without explanation. The 'monitor' operation of Master resource is scheduled by cluster once the 'promote' operation reports success so it gives a very little hope to anything to magically recover from failure. Reporting success in places where failure should have occurred is in most cases confusing and to me it is comparable to reporting '0' always for 'start' operation and hoping that next 'monitor' will detect if 'start' was not successful. Also as you mention that this should be enterprise solution I think that having a clear reporting of errors leads to faster identification of problems. Yet this change might be not as easy as I initially thought and I will try to explain it below.
Based on your feedback it seems that there is some dependency between 'promote' and 'monitor' operation that might get broken. Below are scenarios that can happen to 'promote' operation during operation that I could think of.
=== A - promote was a real success 'promote' operation would return '0' so cluster reports Master. Cluster issue 'monitor' operation on Master instance.
saphana_monitor_clone
function is calledcheck_for_primary
will try to determine if we are Primary
saphana_monitor_primary
continues checkingget_hana_landscape_status
and then we are a proper running Mastersaphana_monitor_secondary
where nothing can return $OCF_RUNNING_MASTER so we will fail monitor operation and 'demote'->'stop'->'start' will occur=== B - promote failed but it was not a timeout 'promote' operation would return '0' so cluster reports Master. Cluster issue 'monitor' operation on Master instance.
saphana_monitor_clone
function is calledcheck_for_primary
will try to determine if we are Primary
saphana_monitor_primary
continues checking **get_hana_landscape_status
and we are a proper running Mastersaphana_monitor_secondary
where nothing can return $OCF_RUNNING_MASTER so we will fail monitor and 'demote'->'stop'->'start' will occur** saphana_promote_clone have only 2 states in which we fail
check_for_primary
didn't returned PRIMARY - so we never took over PRIMARY role$hana_sync
is not SOK
- here we were not suitable for becoming PRIMARY=== C - promote failed with timeout
=== D - promote failed with proper failure code (this is my proposal here)
Looking at scenario B it looks that just removing the line with rc=$?
seems to not cover all the use cases as the 'monitor' operation may end in some branch that will change the node attributes. Like there is some code in resource agent that changes something before we 'demote'->'stop'->'start' the resource. Therefore it would be needed to ensure that when the 'promote' fails, then the relevant parts of saphana_monitor_clone
are called same way as is done now when first 'monitor' appears. I have some preliminary code to address this (untested) here to demonstrate the idea of "decoupling" 'promote' and 'monitor' operation that would allow returning errors by 'promote' but still run functions that would do 'monitor' in current state where the 'promote' cannot fail even if some issue appears. This should add the code to saphana_promote_clone
that happens before the 'monitor' detects that we will do 'demote'->'stop'->'start'.
https://github.com/OndrejHome/SAPHanaSR/tree/promote_clone_fix
Please let me know what you think about above approach of "decoupling" the 'monitor' and 'promote' operations. Also feel free to comment on code, but note that at present time it is a draft idea that was not tested.
@OndrejHome Sorry your argumentation with the following monitor is UNTRUE and UNVALID. If a takeover would not been processed the SAP HANA would stay as primary but running. This would result in a valid monitor result and the HANA would be keept up and running.
WE WILL NOT INTRODUCE A REGRESSION BECAUSE ONLY MAKING CODE-READERS HAPPY. CUSTOMER DATA AND USER EXPERIENCE IS IMPORTANT HERE.
The current code is valid and beside some improvements and fixes very, very stable. Your change would need lot of tests with all current and previous SAP HANA versions to figure out, if there might be a regression.
I would like to STOP that discussion here and to move such thoughts back to people online in the SAP LinuxLab. If really Frank and I would think that this is an important change we will of course come back to you. For now I declare that change as rejected to limit potential customer data loss.
The line 2469 will always result in exit code
0
as the last command in any of the branches before it are assignment of variable so that returns0
on success.https://github.com/fmherschel/SAPHanaSR/blob/15e17acc741748dff4f8bf96b410be8e07c298b3/SAPHana/ra/SAPHana#L2469
Removing above line results in proper detection of failed promote. To test that this works properly after removing the line line 2469 it is enough to just comment the takeover command on line 2444 and test that in such case the 'promote' of SAPHana fails.
https://github.com/fmherschel/SAPHanaSR/blob/15e17acc741748dff4f8bf96b410be8e07c298b3/SAPHana/ra/SAPHana#L2444
Or is there any other intention with assignment on line 2469 that I don't see?