Seagate / halon

High availability solution
Apache License 2.0
1 stars 0 forks source link

HALON-865: Add ST to verify satellite restart. #1505

Closed 1468ca0b-2a64-4fb4-8e52-ea5806644b4c closed 5 years ago

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

This ST verify if mero process restart on satellite restart. Currently ST work on vagrant cluster for multi-node setup and skips for single node setup.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

Introduce cluster_status function.

cluster_status() {
    hctl mero status | head -1 | awk '{print $3}'
}

if [[ $(cluster_status) == 'OFFLINE' ]]; then
    hctl mero start
fi
1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

[style] Use local within functions to avoid polluting the namespace.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

JSON output of Halon is not trustworthy.

In this snippet of hctl mero status --json output

              "crpServices": [
                {
                  "crsDevices": [],
                  "crsService": {
                    "s_endpoints": [
                      "192.168.195.141@tcp:12345:34:101"
                    ],
                    "s_type": {
                      "tag": "CST_HA"
                    },
                    "s_fid": {
                      "f_key": 43,
                      "f_container": 8286623314361713000
                    }
                  },

we see a fid with f_container part equal to 8286623314361713000 = 0x7300000000000168. The proper value would be 0x7300000000000001.

0x7300000000000001 -- good
0x7300000000000168 -- bad

I have opened HALON-882 for this.

There is also HALON-872 bug.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

Need to increase time to check state change. on my vagrant setup it always takes time more than 60 secs.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

Agreed, since st script itself providing cluster config now.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

@rajanikantchirmade I've updated the pull request. Please check if there are any mistakes.

Note, that I have removed

sudo sed -i 's/data_units: 8/data_units: 4/' /etc/halon/halon_facts.yaml

line from the script, because I am using /dev/sd[b-g] pattern for disks. If this is a problem, let me know and I will modify the script.

I think we can add N,K,P values in cluster.yaml

pdclust-args: [N, K, P]

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

@rajanikantchirmade I've updated the pull request. Please check if there are any mistakes.

Note, that I have removed

sudo sed -i 's/data_units: 8/data_units: 4/' /etc/halon/halon_facts.yaml

line from the script, because I am using /dev/sd[b-g] pattern for disks. If this is a problem, let me know and I will modify the script.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

@chumakd Would you look at my commits please?

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

abort with error message would be much more useful.

    abort "m0d has not been restarted on $SAT"
1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

This check is pretty useless. It doesn't care if halond processes are ‘online’ or not, it just... counts the entries in the resource graph? What for? We've specified four (4) nodes in cluster configuration. There will be just as many ‘halon’ lines in hctl mero status output.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

  1. There is no point in using report_and_exit when it is known in advance that the exit code will be non-zero.
  2. Failure of cluster_bootstrap will terminate the script due to set -e in st/common. Lines 70–72 are not needed.
1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

Use a helper function:

abort() {
    [[ $# == 0 ]] || {
        echo "$@" >&2
        echo 'Aborting...' >&2
    }
    hctl mero stop
    $H0 fini
    exit 1
}

report_and_exit with non-zero 2nd argument only adds noise; die or exit 1 should be used instead.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

[nit] < is simpler to read. Please replace with

if ((nr_nodes < 2)); then

(Note that $ is not necessary in ((...)) expressions.)

See http://www.tldp.org/LDP/abs/html/arithexp.html .

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

s/g//

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

Hard-coded fids are too fragile. A minor change to Halon RG generation procedure may shift the numbers, and we'll have hard time debugging this script.

You should retrieve fids from hctl mero status output.

SAT='ssu1'

init_fids() {
    hctl mero status | awk -v HOST=$SAT '
function print_or_die(s) {
    if (!s) {
        print "Parsing error" > "/dev/stderr"
        exit 1
    }
    print s
}

$3 ~ /^0x6e/ {
    collect_p = ($4 == HOST);
    if (collect_p)
        node = $3
}

collect_p && $3 ~ /^0x72/ {
    if ($5 == "halon")
        proc_halond = $3
    else if ($5 == "ioservice")
        proc_m0d = $3
}

END {
    print_or_die(node)
    print_or_die(proc_halond)
    print_or_die(proc_m0d)
}
' | {
    read FID_NODE_SAT
    read FID_PROC_SAT
    read FID_PROC_M0D
    }
}
1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

[optional] You can use an emoji instead of “Done.” if you like. E.g., ✔️ (:heavy_check_mark:).

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

Resolved issue. Add multi-node

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

Removed.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

Done

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

Done.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

Done.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

Done

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

done

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

cleanup_and_exit() calls report_and_exit ${0##*/} $?

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

(I was wrong by suggesting cluster_status name in the first place.)

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

What's the point of comparing PIDs? There is no chance for old PID to survive sudo kill -9.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

No, this is silly. You should periodically check the status of those processes in hctl mero status output. See how _wait_for_m0t1fs function, defined in st/common, does periodic checking.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

Rename to SAT for brevity and check that the host is accessible.

SAT='ssu1.local'  # Halon satellite node
pdsh -w $SAT true >/dev/null || die "$SAT is not accessible"
1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

[nit] Don't put period at the end of error message unless it consists of several sentences. Rationale: Unix tradition (example).

BTW, titles don't end with period either. This goes for git commit subject lines, email subjects, pull request titles (kindly fix this one), Jira ticket titles, section headers in articles/books, etc.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

Don't worry about single-node case just yet. Pretend that this script will always be executed on a multi-node cluster.

#!/usr/bin/env bash
. $H0_SRC_DIR/st/common

M0_CLUSTER=$(mktemp)
trap "rm $M0_CLUSTER" 0

cat >$M0_CLUSTER <<'EOF'
confds: [ cmu.local ]
ssus:
  - host:  ssu1.local
    disks: /dev/sd[b-g]
  - host:  ssu2.local
    disks: /dev/sd[b-g]
clients:     [ client1.local ]
clovis-apps: [ client1.local ]
EOF

export M0_CLUSTER
$H0 init
cluster_bootstrap

[...rest of the code...]

We will ensure that this script doesn't break single-node cluster, but we'll do that later. Right now let us focus on the actual scenario (which is multi-node).

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

A system test should start with $H0 init. Only then can hctl be used.

See how other system tests are implemented.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

Let us not change the naming used in hctl mero status output without good reason.

s/status/disposition/
1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

For what reason is hctl mero status failure tolerated? This doesn't look good.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

Use file name instead of "Passed.".

$ tail -1 st/t_*
==> st/t_sns-rep-reb <==
report_and_exit ${0##*/} $?

==> st/t_stop-reverts-sdev-states <==
report_and_exit ${0##*/} $?
1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

Fixed issues. Thank you.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

Done.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

Yes, Agree. Thank you.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

Done.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

Done.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

Done.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

Done.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

Done.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

Done. Thank you.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

Done.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

Done.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

Done.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

Don't use test && action pattern in shell scripts that have set -e enabled. The script will terminate in case test or action return non-zero code. (test && action || true would work.)

if [[ $sat_pid == $new_sat_pid ]]; then
    cleanup_and_exit 'Satellite not restarted' 1
fi

or

[[ $sat_pid != $new_sat_pid ]] || cleanup_and_exit 'Satellite not restarted' 1
1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

I see no reason in using pdsh to execute command on a single node.

sat_pid=$(ssh $SAT_NODE.local pgrep halond)
1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

s/Not not/Not/

[ $node_status = 'online' ] || cleanup_and_exit 'Not started' 1
1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

[nit] s/\|head/ | head/