CORTX ha (High-Availability) is responsible for ensuring that CORTX Solution is available in case of any hardware component or software service failures. It takes care of failover/ failback control flow for affected services and stabilizes them across CORTX cluster.
HA is not able to detect events for server nodes because monitor service requires list of ids(machine-ids) that can be monitored. List does not contain server node machine-ids because there is a change in cluster.conf. Previously, cluster.conf has ha-proxy service listed which was used to identify server pods. But, now, s3 is replaced by rgw now and there will not be any services in rgw as of now.
https://jts.seagate.com/browse/CORTX-27961
Design
Use COMPONENT_RGW constant defined in cortx-utils to identify server pod. server pod config will be like this:
9f4033f2f0304859b175ae9680c28a8d:
cluster_id: 1a3acac8233d425c8dbfad522a99feca
components:
name: utils
name: hare
name: rgw
hostname: cortx-server-headless-svc-ssc-vm-g4-rhev4-0582
name: cortx-server-headless-svc-ssc-vm-g4-rhev4-0582
num_components: 3
storage_set: storage-set-1
type: server_node
so, instead of "services", "name" will be provided in recursive search pattern of ConfStoreSearch
Coding
[x] Coding conventions are followed and code is consistent
Testing
[ ] Unit and System Tests are added
[ ] Test Cases cover Happy Path, Non-Happy Path and Scalability
[x] Testing was performed with RPM
Review Checklist
[x] PR is self reviewed
[x] JIRA number/GitHub Issue added to PR
[x] Jira and state/status is updated and JIRA is updated with PR link
[x] Check if the description is clear and explained
[ ] If yes for above point, is a notification sent to all other cortx components? [Y/N] N
[x] Side effects on other features (deployment/upgrade)? [Y/N] N
[x] Dependencies on other component(s)? [Y/N] N
If yes for above point, post link to the corresponding PR.
Review Checklist
[ ] Is perfline test run and the report with and without the changes updated in the PR? [Y/N]:
Documentation
Checklist for Author
[ ] Changes done to WIKI / Confluence page / Quick Start Guide
Testing details:
Testing is performed on Kanchan's setup where latest cluster.conf changes are present.
Problem Statement
HA is not able to detect events for server nodes because monitor service requires list of ids(machine-ids) that can be monitored. List does not contain server node machine-ids because there is a change in cluster.conf. Previously, cluster.conf has ha-proxy service listed which was used to identify server pods. But, now, s3 is replaced by rgw now and there will not be any services in rgw as of now. https://jts.seagate.com/browse/CORTX-27961
Design
Use COMPONENT_RGW constant defined in cortx-utils to identify server pod. server pod config will be like this: 9f4033f2f0304859b175ae9680c28a8d: cluster_id: 1a3acac8233d425c8dbfad522a99feca components:
Coding
Testing
Review Checklist
Review Checklist
Documentation
Checklist for Author
Testing details:
Testing is performed on Kanchan's setup where latest cluster.conf changes are present.
After my changes, sys health keys are:
[root@cortx-ha-headless-svc /]# consul kv get -http-addr=consul-server.default.svc.cluster.local:8500 --recurse cortx/ha/v1 cortx/ha/v1: cortx/ha/v1/action/disk/failed:["publish"] cortx/ha/v1/action/disk/online:["publish"] cortx/ha/v1/action/node/failed:["publish"] cortx/ha/v1/action/node/online:["publish"] cortx/ha/v1/cluster_cardinality:{"num_nodes": 6, "node_list": ["efbc2702ad0240fb82e33c9ddd08715c", "776751796bc54a499a13b32810246184", "26a6220d9c094d829532c53be69d0b28", "af87c984ca2949ef9eae8ecfbdba8b46", "9f4033f2f0304859b175ae9680c28a8d", "eaaa21eb42ce4ba1ada69b577cf9a3b5"]} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/health:{"events": [{"event_timestamp": "1647422756.3273737", "created_timestamp": "1647422758", "status": "online", "specific_info": null}, {"event_timestamp": "1647422427", "created_timestamp": "1647422428", "status": "degraded", "specific_info": null}], "action": {"modified_timestamp": "1647422758", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/health:{"events": [{"event_timestamp": "1647422756.3273737", "created_timestamp": "1647422758", "status": "online", "specific_info": null}, {"event_timestamp": "1647422427", "created_timestamp": "1647422427", "status": "degraded", "specific_info": null}], "action": {"modified_timestamp": "1647422758", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/health:{"events": [{"event_timestamp": "1647422756.3273737", "created_timestamp": "1647422758", "status": "online", "specific_info": null}, {"event_timestamp": "1647422427", "created_timestamp": "1647422427", "status": "degraded", "specific_info": null}], "action": {"modified_timestamp": "1647422758", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/node/26a6220d9c094d829532c53be69d0b28/health:{"events": [{"event_timestamp": "1647422756.2646506", "created_timestamp": "1647422757", "status": "online", "specific_info": {"generation_id": "cortx-data-ssc-vm-g4-rhev4-0582-85dbdb6d74-vx6pk", "is_status": "True"}}, {"event_timestamp": "1647422428", "created_timestamp": "1647422428", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647422757", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/node/776751796bc54a499a13b32810246184/health:{"events": [{"event_timestamp": "1647422755.7650583", "created_timestamp": "1647422757", "status": "online", "specific_info": {"generation_id": "cortx-data-ssc-vm-g4-rhev4-0579-755556858d-9tbzq", "is_status": "True"}}, {"event_timestamp": "1647422428", "created_timestamp": "1647422428", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647422757", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/node/9f4033f2f0304859b175ae9680c28a8d/health:{"events": [{"event_timestamp": "1647422756.1444752", "created_timestamp": "1647422757", "status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-g4-rhev4-0582-6d47fdb4bd-v5ftr", "is_status": "True"}}, {"event_timestamp": "1647422429", "created_timestamp": "1647422429", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647422757", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/node/af87c984ca2949ef9eae8ecfbdba8b46/health:{"events": [{"event_timestamp": "1647422755.9636948", "created_timestamp": "1647422757", "status": "online", "specific_info": {"generation_id": "cortx-data-ssc-vm-g4-rhev4-0534-5f965b98c9-mjxzz", "is_status": "True"}}, {"event_timestamp": "1647422428", "created_timestamp": "1647422429", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647422757", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/node/eaaa21eb42ce4ba1ada69b577cf9a3b5/health:{"events": [{"event_timestamp": "1647422754.7373736", "created_timestamp": "1647422755", "status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-g4-rhev4-0579-5cdb86b46b-qxhhf", "is_status": "True"}}, {"event_timestamp": "1647422430", "created_timestamp": "1647422430", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647422755", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/node/efbc2702ad0240fb82e33c9ddd08715c/health:{"events": [{"event_timestamp": "1647422756.3273737", "created_timestamp": "1647422757", "status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-g4-rhev4-0534-5659f44fc8-dj5hm", "is_status": "True"}}, {"event_timestamp": "1647422427", "created_timestamp": "1647422427", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647422757", "status": "pending"}, "attributes": {}} cortx/ha/v1/events/disk/failed:["hare"] cortx/ha/v1/events/disk/online:["hare"] cortx/ha/v1/events/node/failed:["hare"] cortx/ha/v1/events/node/online:["hare"] cortx/ha/v1/events/subscribe/hare:["node/online", "node/failed", "disk/online", "disk/failed"] cortx/ha/v1/message_type/hare:ha_event_hare
Received alert in k8s monitor: 2022-03-16 09:31:55 k8s_resource_monitor [7]: INFO [publish_alert] pod_monitor sending alert on message bus {"header": {"version": "1.0", "timestamp": "1647423115.8818486", "event_id": "1647423115.8818486cafc5e422fab4b63bb6a1a29b030285c"}, "payload": {"source": "", "cluster_id": "", "site_id": "", "rack_id": "", "storageset_id": "", "node_id": "9f4033f2f0304859b175ae9680c28a8d", "resource_type": "node", "resource_id": "9f4033f2f0304859b175ae9680c28a8d", "resource_status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-g4-rhev4-0582-6d47fdb4bd-v5ftr", "is_status": "True"}}} 2022-03-16 09:31:56 k8s_resource_monitor [7]: INFO [publish_alert] pod_monitor sending alert on message bus {"header": {"version": "1.0", "timestamp": "1647423116.2022028", "event_id": "1647423116.2022028b635c52e4cab4c6480ac93a11a633d02"}, "payload": {"source": "", "cluster_id": "", "site_id": "", "rack_id": "", "storageset_id": "", "node_id": "efbc2702ad0240fb82e33c9ddd08715c", "resource_type": "node", "resource_id": "efbc2702ad0240fb82e33c9ddd08715c", "resource_status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-g4-rhev4-0534-5659f44fc8-dj5hm", "is_status": "True"}}} 2022-03-16 09:31:55 k8s_resource_monitor [7]: INFO [publish_alert] pod_monitor sending alert on message bus {"header": {"version": "1.0", "timestamp": "1647423115.4111571", "event_id": "1647423115.4111571c0d5e88c3b9a4a26a9bdf4484b6e9cd5"}, "payload": {"source": "", "cluster_id": "", "site_id": "", "rack_id": "", "storageset_id": "", "node_id": "eaaa21eb42ce4ba1ada69b577cf9a3b5", "resource_type": "node", "resource_id": "eaaa21eb42ce4ba1ada69b577cf9a3b5", "resource_status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-g4-rhev4-0579-5cdb86b46b-qxhhf", "is_status": "True"}}}
Tried replica-set down and up for server pod: [root@cortx-ha-headless-svc /]# consul kv get -http-addr=consul-server.default.svc.cluster.local:8500 --recurse cortx/ha/v1 cortx/ha/v1: cortx/ha/v1/action/disk/failed:["publish"] cortx/ha/v1/action/disk/online:["publish"] cortx/ha/v1/action/node/failed:["publish"] cortx/ha/v1/action/node/online:["publish"] cortx/ha/v1/cluster_cardinality:{"num_nodes": 6, "node_list": ["efbc2702ad0240fb82e33c9ddd08715c", "776751796bc54a499a13b32810246184", "26a6220d9c094d829532c53be69d0b28", "af87c984ca2949ef9eae8ecfbdba8b46", "9f4033f2f0304859b175ae9680c28a8d", "eaaa21eb42ce4ba1ada69b577cf9a3b5"]} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/health:{"events": [{"event_timestamp": "1647423426.8650668", "created_timestamp": "1647423428", "status": "online", "specific_info": null}, {"event_timestamp": "1647423232.185856", "created_timestamp": "1647423232", "status": "degraded", "specific_info": null}], "action": {"modified_timestamp": "1647423428", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/health:{"events": [{"event_timestamp": "1647423426.8650668", "created_timestamp": "1647423428", "status": "online", "specific_info": null}, {"event_timestamp": "1647423232.185856", "created_timestamp": "1647423232", "status": "degraded", "specific_info": null}], "action": {"modified_timestamp": "1647423428", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/health:{"events": [{"event_timestamp": "1647423426.8650668", "created_timestamp": "1647423427", "status": "online", "specific_info": null}, {"event_timestamp": "1647423232.185856", "created_timestamp": "1647423232", "status": "degraded", "specific_info": null}], "action": {"modified_timestamp": "1647423427", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/node/26a6220d9c094d829532c53be69d0b28/health:{"events": [{"event_timestamp": "1647422756.2646506", "created_timestamp": "1647422757", "status": "online", "specific_info": {"generation_id": "cortx-data-ssc-vm-g4-rhev4-0582-85dbdb6d74-vx6pk", "is_status": "True"}}, {"event_timestamp": "1647422428", "created_timestamp": "1647422428", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647422757", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/node/776751796bc54a499a13b32810246184/health:{"events": [{"event_timestamp": "1647422755.7650583", "created_timestamp": "1647422757", "status": "online", "specific_info": {"generation_id": "cortx-data-ssc-vm-g4-rhev4-0579-755556858d-9tbzq", "is_status": "True"}}, {"event_timestamp": "1647422428", "created_timestamp": "1647422428", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647422757", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/node/9f4033f2f0304859b175ae9680c28a8d/health:{"events": [{"event_timestamp": "1647422756.1444752", "created_timestamp": "1647422757", "status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-g4-rhev4-0582-6d47fdb4bd-v5ftr", "is_status": "True"}}, {"event_timestamp": "1647422429", "created_timestamp": "1647422429", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647422757", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/node/af87c984ca2949ef9eae8ecfbdba8b46/health:{"events": [{"event_timestamp": "1647422755.9636948", "created_timestamp": "1647422757", "status": "online", "specific_info": {"generation_id": "cortx-data-ssc-vm-g4-rhev4-0534-5f965b98c9-mjxzz", "is_status": "True"}}, {"event_timestamp": "1647422428", "created_timestamp": "1647422429", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647422757", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/node/eaaa21eb42ce4ba1ada69b577cf9a3b5/health:{"events": [{"event_timestamp": "1647422754.7373736", "created_timestamp": "1647422755", "status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-g4-rhev4-0579-5cdb86b46b-qxhhf", "is_status": "True"}}, {"event_timestamp": "1647422430", "created_timestamp": "1647422430", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647422755", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/node/efbc2702ad0240fb82e33c9ddd08715c/health:{"events": [{"event_timestamp": "1647423426.8650668", "created_timestamp": "1647423426", "status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-g4-rhev4-0534-5659f44fc8-nntj6"}}, {"event_timestamp": "1647423232.185856", "created_timestamp": "1647423232", "status": "failed", "specific_info": {"generation_id": "cortx-server-ssc-vm-g4-rhev4-0534-5659f44fc8-dj5hm"}}], "action": {"modified_timestamp": "1647423426", "status": "pending"}, "attributes": {}} cortx/ha/v1/events/disk/failed:["hare"] cortx/ha/v1/events/disk/online:["hare"] cortx/ha/v1/events/node/failed:["hare"] cortx/ha/v1/events/node/online:["hare"] cortx/ha/v1/events/subscribe/hare:["node/online", "node/failed", "disk/online", "disk/failed"] cortx/ha/v1/message_type/hare:ha_event_hare [root@cortx-ha-headless-svc /]#