Closed 1468ca0b-2a64-4fb4-8e52-ea5806644b4c closed 5 years ago
Created by: vvv
Introduce cluster_status
function.
cluster_status() {
hctl mero status | head -1 | awk '{print $3}'
}
if [[ $(cluster_status) == 'OFFLINE' ]]; then
hctl mero start
fi
Created by: vvv
[style] Use local
within functions to avoid polluting the namespace.
Created by: vvv
JSON output of Halon is not trustworthy.
In this snippet of hctl mero status --json
output
"crpServices": [
{
"crsDevices": [],
"crsService": {
"s_endpoints": [
"192.168.195.141@tcp:12345:34:101"
],
"s_type": {
"tag": "CST_HA"
},
"s_fid": {
"f_key": 43,
"f_container": 8286623314361713000
}
},
we see a fid with f_container
part equal to 8286623314361713000 = 0x7300000000000168. The proper value would be 0x7300000000000001.
0x7300000000000001 -- good
0x7300000000000168 -- bad
I have opened HALON-882 for this.
There is also HALON-872 bug.
Created by: rajanikantchirmade
Need to increase time to check state change. on my vagrant setup it always takes time more than 60 secs.
Created by: rajanikantchirmade
Agreed, since st script itself providing cluster config now.
Created by: rajanikantchirmade
@rajanikantchirmade I've updated the pull request. Please check if there are any mistakes.
Note, that I have removed
sudo sed -i 's/data_units: 8/data_units: 4/' /etc/halon/halon_facts.yaml
line from the script, because I am using
/dev/sd[b-g]
pattern for disks. If this is a problem, let me know and I will modify the script.
I think we can add N,K,P values in cluster.yaml
pdclust-args: [N, K, P]
Created by: vvv
@rajanikantchirmade I've updated the pull request. Please check if there are any mistakes.
Note, that I have removed
sudo sed -i 's/data_units: 8/data_units: 4/' /etc/halon/halon_facts.yaml
line from the script, because I am using /dev/sd[b-g]
pattern for disks. If this is a problem, let me know and I will modify the script.
Created by: vvv
@chumakd Would you look at my commits please?
Created by: vvv
abort
with error message would be much more useful.
abort "m0d has not been restarted on $SAT"
Created by: vvv
This check is pretty useless. It doesn't care if halond processes are ‘online’ or not, it just... counts the entries in the resource graph? What for? We've specified four (4) nodes in cluster configuration. There will be just as many ‘halon’ lines in hctl mero status
output.
Created by: vvv
report_and_exit
when it is known in advance that the exit code will be non-zero.cluster_bootstrap
will terminate the script due to set -e
in st/common
. Lines 70–72 are not needed.Created by: vvv
Use a helper function:
abort() {
[[ $# == 0 ]] || {
echo "$@" >&2
echo 'Aborting...' >&2
}
hctl mero stop
$H0 fini
exit 1
}
report_and_exit
with non-zero 2nd argument only adds noise; die
or exit 1
should be used instead.
Created by: vvv
[nit] <
is simpler to read.
Please replace with
if ((nr_nodes < 2)); then
(Note that $
is not necessary in ((...))
expressions.)
Created by: vvv
s/g//
Created by: vvv
Hard-coded fids are too fragile. A minor change to Halon RG generation procedure may shift the numbers, and we'll have hard time debugging this script.
You should retrieve fids from hctl mero status
output.
SAT='ssu1'
init_fids() {
hctl mero status | awk -v HOST=$SAT '
function print_or_die(s) {
if (!s) {
print "Parsing error" > "/dev/stderr"
exit 1
}
print s
}
$3 ~ /^0x6e/ {
collect_p = ($4 == HOST);
if (collect_p)
node = $3
}
collect_p && $3 ~ /^0x72/ {
if ($5 == "halon")
proc_halond = $3
else if ($5 == "ioservice")
proc_m0d = $3
}
END {
print_or_die(node)
print_or_die(proc_halond)
print_or_die(proc_m0d)
}
' | {
read FID_NODE_SAT
read FID_PROC_SAT
read FID_PROC_M0D
}
}
Created by: vvv
[optional] You can use an emoji instead of “Done.” if you like.
E.g., ✔️ (:heavy_check_mark:
).
Created by: rajanikantchirmade
Resolved issue. Add multi-node
Created by: rajanikantchirmade
Removed.
Created by: rajanikantchirmade
Done
Created by: rajanikantchirmade
Done.
Created by: rajanikantchirmade
Done.
Created by: rajanikantchirmade
Done
Created by: rajanikantchirmade
done
Created by: rajanikantchirmade
cleanup_and_exit() calls report_and_exit ${0##*/} $?
Created by: vvv
(I was wrong by suggesting cluster_status
name in the first place.)
Created by: vvv
What's the point of comparing PIDs? There is no chance for old PID to survive sudo kill -9
.
Created by: vvv
No, this is silly.
You should periodically check the status of those processes in hctl mero status
output. See how _wait_for_m0t1fs
function, defined in st/common
, does periodic checking.
Created by: vvv
Rename to SAT
for brevity and check that the host is accessible.
SAT='ssu1.local' # Halon satellite node
pdsh -w $SAT true >/dev/null || die "$SAT is not accessible"
Created by: vvv
[nit] Don't put period at the end of error message unless it consists of several sentences. Rationale: Unix tradition (example).
BTW, titles don't end with period either. This goes for git commit subject lines, email subjects, pull request titles (kindly fix this one), Jira ticket titles, section headers in articles/books, etc.
Created by: vvv
Don't worry about single-node case just yet. Pretend that this script will always be executed on a multi-node cluster.
#!/usr/bin/env bash
. $H0_SRC_DIR/st/common
M0_CLUSTER=$(mktemp)
trap "rm $M0_CLUSTER" 0
cat >$M0_CLUSTER <<'EOF'
confds: [ cmu.local ]
ssus:
- host: ssu1.local
disks: /dev/sd[b-g]
- host: ssu2.local
disks: /dev/sd[b-g]
clients: [ client1.local ]
clovis-apps: [ client1.local ]
EOF
export M0_CLUSTER
$H0 init
cluster_bootstrap
[...rest of the code...]
We will ensure that this script doesn't break single-node cluster, but we'll do that later. Right now let us focus on the actual scenario (which is multi-node).
Created by: vvv
A system test should start with $H0 init
. Only then can hctl
be used.
See how other system tests are implemented.
Created by: vvv
Let us not change the naming used in hctl mero status
output without good reason.
s/status/disposition/
Created by: vvv
For what reason is hctl mero status
failure tolerated? This doesn't look good.
Created by: vvv
Use file name instead of "Passed."
.
$ tail -1 st/t_*
==> st/t_sns-rep-reb <==
report_and_exit ${0##*/} $?
==> st/t_stop-reverts-sdev-states <==
report_and_exit ${0##*/} $?
Created by: rajanikantchirmade
Fixed issues. Thank you.
Created by: rajanikantchirmade
Done.
Created by: rajanikantchirmade
Yes, Agree. Thank you.
Created by: rajanikantchirmade
Done.
Created by: rajanikantchirmade
Done.
Created by: rajanikantchirmade
Done.
Created by: rajanikantchirmade
Done.
Created by: rajanikantchirmade
Done.
Created by: rajanikantchirmade
Done. Thank you.
Created by: rajanikantchirmade
Done.
Created by: rajanikantchirmade
Done.
Created by: rajanikantchirmade
Done.
Created by: vvv
Don't use test && action
pattern in shell scripts that have set -e
enabled. The script will terminate in case test
or action
return non-zero code. (test && action || true
would work.)
if [[ $sat_pid == $new_sat_pid ]]; then
cleanup_and_exit 'Satellite not restarted' 1
fi
or
[[ $sat_pid != $new_sat_pid ]] || cleanup_and_exit 'Satellite not restarted' 1
Created by: vvv
I see no reason in using pdsh
to execute command on a single node.
sat_pid=$(ssh $SAT_NODE.local pgrep halond)
Created by: vvv
s/Not not/Not/
[ $node_status = 'online' ] || cleanup_and_exit 'Not started' 1
Created by: vvv
[nit] s/\|head/ | head/
Created by: rajanikantchirmade
This ST verify if mero process restart on satellite restart. Currently ST work on vagrant cluster for multi-node setup and skips for single node setup.