Why is there a hardcoded timeout of 5s for calls of systemReplicationStatus.py?

fdanapfel commented 1 year ago

With https://github.com/SUSE/SAPHanaSR/commit/7c66a3bd15e114570dbf2076f301041fb57a5259 the option to change the default value for HANA_CALL_TIMEOUT was (re-)introduced.

But for calls of systemReplicationStatus.py a hardcoded value of 5s is still used: https://github.com/SUSE/SAPHanaSR/blob/master/ra/SAPHana#L1312

Is there a specific reason for this?

The reason I ask is because we received the following from a customer who ran into issues with their cluster setup due to this hardcoded timeout (the analysis actually seems to have been done by SAP): " Mar 11 06:00:48 SAPHana(SAPHana_SPR_00)[3828054]: INFO: RA ==== begin action monitor_clone (0.154.0) ==== Mar 11 06:00:48 SAPHana(SAPHana_SPR_00)[3828054]: INFO: RA: SRHOOK1=PRIM Mar 11 06:00:48 SAPHana(SAPHana_SPR_00)[3828054]: INFO: RA: SRHOOK3=PRIM Mar 11 06:00:57 SAPHana(SAPHana_SPR_00)[3828054]: INFO: RA: SRHOOK1=PRIM Mar 11 06:00:57 SAPHana(SAPHana_SPR_00)[3828054]: INFO: RA: SRHOOK3=PRIM Mar 11 06:01:03 SAPHana(SAPHana_SPR_00)[3828054]: WARNING: HANA_CALL timed out after 5 seconds running command 'systemReplicationStatus.py --site=Site2' Mar 11 06:01:03 SAPHana(SAPHana_SPR_00)[3828054]: INFO: DEC analyze_hana_sync_statusSRS systemReplicationStatus.py (to site 'Site2')-> 124 Mar 11 06:01:03 SAPHana(SAPHana_SPR_00)[3828054]: INFO: ACT site=Site1, setting SFAIL for secondary (2) - srRc=124 Mar 11 06:01:03 SAPHana(SAPHana_SPR_00)[3828054]: INFO: RA ==== end action monitor_clone with rc=8 (0.154.0) (18s)====

When executed manually directly in the O/S, this command typically takes around 3 – 4 seconds to return depending on HANA load. It is possible under high HANA load that this command may take slightly longer to execute. For example, when I executed this command manually in SPR today at 13:00AEDT it responded in 6.7 seconds, which would be enough to cause the cluster to indicate a failure, as its “overrun” by 1.7 seconds.

This then caused the cluster to failover.

It is possible that the HANA system cannot respond in the hard coded 5 seconds in all load situation, which is falsely causing the cluster to raise a failure as HANA has not responded as expected. Setting the parameter HANA_CALL_TIMEOUT has no effect here, as the timeout is hardcoded to 5 seconds."

Edit: I found https://github.com/SUSE/SAPHanaSR/commit/2cc0fba04a59b23ee0a5f2ec44172bf5d7a06ff0 where the hardcoded timeout was lowered from 60s to 5s, but there is no clear explanation on the reason behind this.

fmherschel commented 1 year ago

@fdanapfel The limit by 5s instead of HANA_CALL_TIMEOUT is by intention. While 5s ist just a very short time << 30s, HANA_CALL_TIMEOUT could be longer then the 30s which a commit is holded on the primary. So a longer timeout would open a gap of loosing data. Anyhow the always better method is to use the system-replication hook which informs the cluster in time. In such a situation it is uncritical, if the srPoll mechanism times out, because a valid answer from the hook script (srHook) has always preference for cases "SOK" or "SFAIL". The polling attribute (srPoll) is only used as fallback, if the hook attribute is not present, or any value like "SNA", "SWAIT" or something not in the set of "SOK" and "SFAIL".

fmherschel commented 1 year ago

@fdanapfel The reduction from 60s to 5s was a first step to lower the risk of loosing data. Unfortunately the API of SP HN* lies for some time and still says everything is ok, when asking for the systemReplicationStatus. The time beween the correct answer (Failure of SR) and the release of the hold commit is very short. So additionally to the shortened timeout for poll we also introduced a hook script covering the monitoring of SR-HADR-Events.

fdanapfel commented 1 year ago

@fmherschel The customer is actually using the srConnectionStateChanged () hook, but we now found out that they are using Multitarget Replication in combination with HANA Scale-Up System Replication which is currently not supported with the resource agents we ship.

But a colleague from SAP pointed out something interesting with regards to the hardcoded timeout in relation to support for Multitarget Replication while reviewing the resource agents for managing HANA Scale-Out System Replication:

"If I look at the SUSE SAPHanaSR-ScaleOut SAPHanaController resource agent in GitHub, there is something interesting about the system replication check in this scenario

https://github.com/SUSE/SAPHanaSR-ScaleOut/blob/master/SAPHana/ra/SAPHanaController

function analyze_hana_sync_statusSRS()
{
    super_ocf_log info "FLOW ${FUNCNAME[0]} ($*)"
    local rc=-1 srRc=0 all_nodes_other_side="" n="" siteParam=""
    if [ -n "$remSR_name" ]; then
       siteParam="--site=$remSR_name"
    fi
    FULL_SR_STATUS=$(HANA_CALL --timeout $HANA_CALL_TIMEOUT --cmd "python systemReplicationStatus.py $siteParam" 2>/dev/null); srRc=$?
    super_ocf_log info "FLOW ${FUNCNAME[0]} systemReplicationStatus.py (to site '$remSR_name')-> $srRc"
    #

Note that the 5 second hardcoding has actually been removed in the SAPHanaTopology resource agent used for HANA Scale-Out System Replication HA and the $HANA_CALL_TIMEOUT value is adopted instead.

This does make sense as there are more servers in a HANA Multi-Target System Replication topology to query."

Not sure if Multitarget Replication is actually supported with the current versions of the resource agents for HANA Scale-Up System Replication, but if it is on the roadmap then it might probably make sense to reconsider the hardcoded timeout for the systemReplicationStatus.py call in the SAPHana resource agent, because based on the statement from the SAP colleague a too low timeout could actually increase the risk of loosing data in Multitarget Replication setups (due to accidental failovers if the systemReplicationStatus.py script takes longer to respond).

fmherschel commented 1 year ago

@fdanapfel We develop our solution together with customers. They are very happy with the solution. Our solution works perfect, if you are using the described hooks (see man pages). This is all documented in the man pages here in source tree. There is no(!) data loss (tested intensive with a fast running application together with customers). For Scale-Up you need the hook anyway, because the sync state is not multi target aware. For scale-out the hook was already needed since the start. For Multi-Target you need the newer hook. This needed, because the HADR API from SAP initially was not able to differ secondary sites. In sum: Always a HADR hook which checks the SR status is needed. The SRS function is only a fallback to have a second indicator. With any timeout the SRS comes alwys later then the HADR hook result. If you check your cluster using SAPHanaSR-showAttr you could see on realtime what I have described above.

MultiTarget SR is supported with our current (and upstream) versions. Requirements are described in the man pages.

fmherschel commented 1 year ago

@fdanapfel Could we close this issue?

lpinne commented 1 year ago

Hi @fdanapfel ,

the manual pages SAPHanaSR.py(7) in project SAPHanaSR as well as SAPHanaSR.py(7) and SAPHanaSrMultiTarget.py(7) in project SAPHanaSR-ScaleOut are describing requirements and use cases. Same info on available HA/DR provider hook scripts and supported scenarios can be found as table at https://documentation.suse.com/sles-sap/sap-ha-support/html/sap-ha-support/index.html#sap-ha-solutions-sap-hana-hook-scripts

Regards, Lars

fdanapfel commented 1 year ago

@fmherschel @lpinne Thanks for all the replies, however they unfortunately don't answer the actual question: why is there a hardcoded timeout of 5 seconds in the SAPHana resource agent for the call to systemReplicationStatus.py, but in the SAPHanaController resource agent this same timeout is configurable?

lpinne commented 1 year ago

Hi @fdanapfel

In RA SAPHana there is a hardcoded timeout of 5 seconds, because that initially was the only chance to get an reliable status of the SR. It has been introduced ~10 years ago with HANA 1.0. The better method of using the HA/DR provider hook came later. There might be still clusters running without the hook. However, the polling is sub-optimal and should be replaced by the hook. That means changing that RA timeout might break old existing clusters, but is not needed for new deplyoments.
With RA SAPHanaController there always has been the HA/DR provider hook to get the SR status. Thus there has been no need for tight timing.

This is decribed in the mentioned manual pages.

Regards, Lars

fmherschel commented 1 year ago

@fdanapfel 5s for scale-up. Because there have been a time where we supported setups without a hook. scale-out to be more sloppy, because scale-out always has the hook implemented and never have been supported without the hook. Allowed to be sloppy, because the SRS in scale-out is only beeded, if a hook event "SOK" has been missed and the Polling would be the only way to figure out what the current status is. Even this last SRS usage is now not needed that intensive (any more) since the hook now also supports a fall-back communication, if the hook script could not reach the cluster any more. So the difference completely makes sense due to support rules in the past.

fdanapfel commented 1 year ago

@fmherschel @lpinne Thanks a lot for the detailed explanations. With this the issue can be closed.

fmherschel commented 1 year ago

@fdanapfel You are always welcome!

SUSE / SAPHanaSR

Why is there a hardcoded timeout of 5s for calls of systemReplicationStatus.py? #169