fmherschel / SAPHanaSR-old-private

15 stars 8 forks source link

Show detailed sync state in crm_mon? #40

Closed jh23453 closed 6 years ago

jh23453 commented 7 years ago

We often use "crm_mon -rRA" (--inactive --show-detail --show-node-attributes) to use the cluster status (master/slave on which host, sync status, HANA status). Until now we use HANA Studio to view the detailed SR status, because SFAIL can have two meanings:

  1. Sync ist still running, but will finish eventually
  2. Sync has finally failed (needs full-sync etc.)

We might use systemReplicationStatus.py on the command line as well, but will need both crm_mon and systemReplicationStatus.py for a complete status.

I've tried to get a more detailed status in one line and think about adding it to "crmmon -A". Right now we have the attribute `hana${sid}_syncstate(SOK, SFAIL, UNKNOWN) for the secondary and PRIM for the primary HANA DB. What would be a useful, more detailed status (attribute likehana${sid}_sync_detail`)?

Right now I check the detailed state with a standalone awk-script (around 10 lines...), but if you think that would be a useful addition I'll try to add it to the cluster attributes and provide a patch for SAPHana.

fmherschel commented 7 years ago

Why not using SAPHanaSR-showAttr?

jh23453 commented 7 years ago

SAPHanaSR-showAttr shows exactly the attributes that crm_mon has (but on less screen space). Still, the sync state for the secondary is either SOK or SFAIL. But SFAIL has to possible reasons:

  1. Sync has failed (and intervention by the HANA admin is required)
  2. Secondary is still syncing, but will eventually succeed.

Right now the admin must look at HANA studio or landscape.py to distiguish the two.

When we first implemented our cluster the Linux admin saw SFAIL and always thought that HANA S/R had failed, even when it was still syncing. So HANA admins looked the state up and provided updates to the Linux admins. I think it's a nice idea to see what's the reason for SFAIL is.

Another possibility could be to split the SFAIL into two states - SFAIL for 1. and STILL-SYNCING for 2. But that would require some more extensive changes to the SAPHana logic - instead of simply adding another attribute to display.

Does that clarify what I think?

fmherschel commented 7 years ago

I do not think that we can show 'all' possible status situations/information around HANA just using crm_mon. Maybe if you write an enhanced wrapper for systemReplicationStatus.py and tell your admins to call that tool, if the attribute is set to "SFAIL" that could be a solution.

Changing from "SOK/SFAIL" to "SOK/SFAIL/STILL-SYNCING" is not an option, because this would break backward compatibility and it would also not be compatible with Scale-Out where we need to use a different interface to be informed (HA/DR provider). In this latter case we do not have a return code but get called, if the SR is in sync again.

jh23453 commented 7 years ago

fmherschel notifications@github.com writes:

I do not think that we can show 'all' possible status situations/information around HANA just using crm_mon. Maybe if you write an enhanced wrapper for systemReplicationStatus.py and tell your admins to call that tool, if the attribute is set to "SFAIL" that could be a solution.

I guess so - I'll append the script I did for us next week to the issue and close it.

Changing from "SOK/SFAIL" to "SOK/SFAIL/STILL-SYNCING" is not an option, because this would break backward compatibility and it would also not be compatible with Scale-Out where we need to use a different interface to be informed (HA/DR provider). In this latter case we do not have a return code but get called, if the SR is in sync again.

Yes, that's also what I saw - adding another state would be too complex even in our (simple) scenario.

Thank's for your feedback and your work - we're quite happy with our current installation.

-- This space is intentionally left blank.

fmherschel commented 7 years ago

I guess so - I'll append the script I did for us next week to the issue and close it.

That would be great. We would review it and add it to the 'tools' of the SAPHanaSR package. Alternatively you could create a pull request to this project. Add your tool to subdirectory "tools".

jh23453 commented 7 years ago

Here's the script. have it running on the (new) master to monitor sync progress/state as adm. We take the output from systemReplicationStatus.py and display only services not OK with the detailed status. We run it with "watch ./sr_status_short.sh" to get a current state every two seconds.

If you have questions or I should add comments, feel free to ask.

#!/bin/bash

FULL_SR_STATUS=$(python /hana/shared/$SAPSYSTEMNAME/exe/linuxppc64/hdb/python_support/systemReplicationStatus.py 2>/dev/null); srRc=$?

case $srRc in
  10) sr_state="No HANA System Replication";show_detail=0;;
  11) sr_state="Error"                     ;show_detail=0;;
  12) sr_state="Unknown"                   ;show_detail=0;;
  13) sr_state="Initializing"              ;show_detail=1;;
  14) sr_state="Syncing"                   ;show_detail=1;;
  15) sr_state="Active (all services in sync)"  ;show_detail=1;;
  *)  echo "Unknown Status"                ;show_detail=1;;
esac

if [ "$show_detail" = "1" ]; then
  sr_state_detail=$(gawk -F '|' \
  'function ltrim(s) { sub(/^[ \t\r\n]+/, "", s); return s }
  function rtrim(s) { sub(/[ \t\r\n]+$/, "", s); return s }
  function trim(s)  { return rtrim(ltrim(s)); }

  /^\|/ && NR>3 {
    if ( out != "" ) { out=out "," }
    state=trim($14)
    if ( state != "ACTIVE" ) {
        out=out trim($4) ":"
        out=out state "(" trim($15) ")" }}
  END { if ( out == "" ) {
    print "all services in sync"
  } else {
    print out }}' <<< "$FULL_SR_STATUS")

  echo "$sr_state: $sr_state_detail"
else
  echo "$sr_state"
fi
fmherschel commented 6 years ago

Thank you for providing the code, closing the issue.