Seagate / cortx-ha

CORTX ha (High-Availability) is responsible for ensuring that CORTX Solution is available in case of any hardware component or software service failures. It takes care of failover/ failback control flow for affected services and stabilizes them across CORTX cluster.
https://github.com/Seagate/cortx
GNU Affero General Public License v3.0
4 stars 45 forks source link

Cortx-27961: [ConfStore search API] Use COMPONENT_RGW instead of SERVICE_S3_HAPROXY to identify server pods #668

Closed Madhura-08 closed 2 years ago

Madhura-08 commented 2 years ago

Problem Statement

HA is not able to detect events for server nodes because monitor service requires list of ids(machine-ids) that can be monitored. List does not contain server node machine-ids because there is a change in cluster.conf. Previously, cluster.conf has ha-proxy service listed which was used to identify server pods. But, now, s3 is replaced by rgw now and there will not be any services in rgw as of now. https://jts.seagate.com/browse/CORTX-27961

Design

Use COMPONENT_RGW constant defined in cortx-utils to identify server pod. server pod config will be like this: 9f4033f2f0304859b175ae9680c28a8d: cluster_id: 1a3acac8233d425c8dbfad522a99feca components:

Coding

Testing

Review Checklist

Review Checklist

Documentation

Checklist for Author

Testing details:

Testing is performed on Kanchan's setup where latest cluster.conf changes are present.

After my changes, sys health keys are:

[root@cortx-ha-headless-svc /]# consul kv get -http-addr=consul-server.default.svc.cluster.local:8500 --recurse cortx/ha/v1 cortx/ha/v1: cortx/ha/v1/action/disk/failed:["publish"] cortx/ha/v1/action/disk/online:["publish"] cortx/ha/v1/action/node/failed:["publish"] cortx/ha/v1/action/node/online:["publish"] cortx/ha/v1/cluster_cardinality:{"num_nodes": 6, "node_list": ["efbc2702ad0240fb82e33c9ddd08715c", "776751796bc54a499a13b32810246184", "26a6220d9c094d829532c53be69d0b28", "af87c984ca2949ef9eae8ecfbdba8b46", "9f4033f2f0304859b175ae9680c28a8d", "eaaa21eb42ce4ba1ada69b577cf9a3b5"]} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/health:{"events": [{"event_timestamp": "1647422756.3273737", "created_timestamp": "1647422758", "status": "online", "specific_info": null}, {"event_timestamp": "1647422427", "created_timestamp": "1647422428", "status": "degraded", "specific_info": null}], "action": {"modified_timestamp": "1647422758", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/health:{"events": [{"event_timestamp": "1647422756.3273737", "created_timestamp": "1647422758", "status": "online", "specific_info": null}, {"event_timestamp": "1647422427", "created_timestamp": "1647422427", "status": "degraded", "specific_info": null}], "action": {"modified_timestamp": "1647422758", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/health:{"events": [{"event_timestamp": "1647422756.3273737", "created_timestamp": "1647422758", "status": "online", "specific_info": null}, {"event_timestamp": "1647422427", "created_timestamp": "1647422427", "status": "degraded", "specific_info": null}], "action": {"modified_timestamp": "1647422758", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/node/26a6220d9c094d829532c53be69d0b28/health:{"events": [{"event_timestamp": "1647422756.2646506", "created_timestamp": "1647422757", "status": "online", "specific_info": {"generation_id": "cortx-data-ssc-vm-g4-rhev4-0582-85dbdb6d74-vx6pk", "is_status": "True"}}, {"event_timestamp": "1647422428", "created_timestamp": "1647422428", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647422757", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/node/776751796bc54a499a13b32810246184/health:{"events": [{"event_timestamp": "1647422755.7650583", "created_timestamp": "1647422757", "status": "online", "specific_info": {"generation_id": "cortx-data-ssc-vm-g4-rhev4-0579-755556858d-9tbzq", "is_status": "True"}}, {"event_timestamp": "1647422428", "created_timestamp": "1647422428", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647422757", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/node/9f4033f2f0304859b175ae9680c28a8d/health:{"events": [{"event_timestamp": "1647422756.1444752", "created_timestamp": "1647422757", "status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-g4-rhev4-0582-6d47fdb4bd-v5ftr", "is_status": "True"}}, {"event_timestamp": "1647422429", "created_timestamp": "1647422429", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647422757", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/node/af87c984ca2949ef9eae8ecfbdba8b46/health:{"events": [{"event_timestamp": "1647422755.9636948", "created_timestamp": "1647422757", "status": "online", "specific_info": {"generation_id": "cortx-data-ssc-vm-g4-rhev4-0534-5f965b98c9-mjxzz", "is_status": "True"}}, {"event_timestamp": "1647422428", "created_timestamp": "1647422429", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647422757", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/node/eaaa21eb42ce4ba1ada69b577cf9a3b5/health:{"events": [{"event_timestamp": "1647422754.7373736", "created_timestamp": "1647422755", "status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-g4-rhev4-0579-5cdb86b46b-qxhhf", "is_status": "True"}}, {"event_timestamp": "1647422430", "created_timestamp": "1647422430", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647422755", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/node/efbc2702ad0240fb82e33c9ddd08715c/health:{"events": [{"event_timestamp": "1647422756.3273737", "created_timestamp": "1647422757", "status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-g4-rhev4-0534-5659f44fc8-dj5hm", "is_status": "True"}}, {"event_timestamp": "1647422427", "created_timestamp": "1647422427", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647422757", "status": "pending"}, "attributes": {}} cortx/ha/v1/events/disk/failed:["hare"] cortx/ha/v1/events/disk/online:["hare"] cortx/ha/v1/events/node/failed:["hare"] cortx/ha/v1/events/node/online:["hare"] cortx/ha/v1/events/subscribe/hare:["node/online", "node/failed", "disk/online", "disk/failed"] cortx/ha/v1/message_type/hare:ha_event_hare

Received alert in k8s monitor: 2022-03-16 09:31:55 k8s_resource_monitor [7]: INFO [publish_alert] pod_monitor sending alert on message bus {"header": {"version": "1.0", "timestamp": "1647423115.8818486", "event_id": "1647423115.8818486cafc5e422fab4b63bb6a1a29b030285c"}, "payload": {"source": "", "cluster_id": "", "site_id": "", "rack_id": "", "storageset_id": "", "node_id": "9f4033f2f0304859b175ae9680c28a8d", "resource_type": "node", "resource_id": "9f4033f2f0304859b175ae9680c28a8d", "resource_status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-g4-rhev4-0582-6d47fdb4bd-v5ftr", "is_status": "True"}}} 2022-03-16 09:31:56 k8s_resource_monitor [7]: INFO [publish_alert] pod_monitor sending alert on message bus {"header": {"version": "1.0", "timestamp": "1647423116.2022028", "event_id": "1647423116.2022028b635c52e4cab4c6480ac93a11a633d02"}, "payload": {"source": "", "cluster_id": "", "site_id": "", "rack_id": "", "storageset_id": "", "node_id": "efbc2702ad0240fb82e33c9ddd08715c", "resource_type": "node", "resource_id": "efbc2702ad0240fb82e33c9ddd08715c", "resource_status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-g4-rhev4-0534-5659f44fc8-dj5hm", "is_status": "True"}}} 2022-03-16 09:31:55 k8s_resource_monitor [7]: INFO [publish_alert] pod_monitor sending alert on message bus {"header": {"version": "1.0", "timestamp": "1647423115.4111571", "event_id": "1647423115.4111571c0d5e88c3b9a4a26a9bdf4484b6e9cd5"}, "payload": {"source": "", "cluster_id": "", "site_id": "", "rack_id": "", "storageset_id": "", "node_id": "eaaa21eb42ce4ba1ada69b577cf9a3b5", "resource_type": "node", "resource_id": "eaaa21eb42ce4ba1ada69b577cf9a3b5", "resource_status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-g4-rhev4-0579-5cdb86b46b-qxhhf", "is_status": "True"}}}

Tried replica-set down and up for server pod: [root@cortx-ha-headless-svc /]# consul kv get -http-addr=consul-server.default.svc.cluster.local:8500 --recurse cortx/ha/v1 cortx/ha/v1: cortx/ha/v1/action/disk/failed:["publish"] cortx/ha/v1/action/disk/online:["publish"] cortx/ha/v1/action/node/failed:["publish"] cortx/ha/v1/action/node/online:["publish"] cortx/ha/v1/cluster_cardinality:{"num_nodes": 6, "node_list": ["efbc2702ad0240fb82e33c9ddd08715c", "776751796bc54a499a13b32810246184", "26a6220d9c094d829532c53be69d0b28", "af87c984ca2949ef9eae8ecfbdba8b46", "9f4033f2f0304859b175ae9680c28a8d", "eaaa21eb42ce4ba1ada69b577cf9a3b5"]} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/health:{"events": [{"event_timestamp": "1647423426.8650668", "created_timestamp": "1647423428", "status": "online", "specific_info": null}, {"event_timestamp": "1647423232.185856", "created_timestamp": "1647423232", "status": "degraded", "specific_info": null}], "action": {"modified_timestamp": "1647423428", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/health:{"events": [{"event_timestamp": "1647423426.8650668", "created_timestamp": "1647423428", "status": "online", "specific_info": null}, {"event_timestamp": "1647423232.185856", "created_timestamp": "1647423232", "status": "degraded", "specific_info": null}], "action": {"modified_timestamp": "1647423428", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/health:{"events": [{"event_timestamp": "1647423426.8650668", "created_timestamp": "1647423427", "status": "online", "specific_info": null}, {"event_timestamp": "1647423232.185856", "created_timestamp": "1647423232", "status": "degraded", "specific_info": null}], "action": {"modified_timestamp": "1647423427", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/node/26a6220d9c094d829532c53be69d0b28/health:{"events": [{"event_timestamp": "1647422756.2646506", "created_timestamp": "1647422757", "status": "online", "specific_info": {"generation_id": "cortx-data-ssc-vm-g4-rhev4-0582-85dbdb6d74-vx6pk", "is_status": "True"}}, {"event_timestamp": "1647422428", "created_timestamp": "1647422428", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647422757", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/node/776751796bc54a499a13b32810246184/health:{"events": [{"event_timestamp": "1647422755.7650583", "created_timestamp": "1647422757", "status": "online", "specific_info": {"generation_id": "cortx-data-ssc-vm-g4-rhev4-0579-755556858d-9tbzq", "is_status": "True"}}, {"event_timestamp": "1647422428", "created_timestamp": "1647422428", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647422757", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/node/9f4033f2f0304859b175ae9680c28a8d/health:{"events": [{"event_timestamp": "1647422756.1444752", "created_timestamp": "1647422757", "status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-g4-rhev4-0582-6d47fdb4bd-v5ftr", "is_status": "True"}}, {"event_timestamp": "1647422429", "created_timestamp": "1647422429", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647422757", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/node/af87c984ca2949ef9eae8ecfbdba8b46/health:{"events": [{"event_timestamp": "1647422755.9636948", "created_timestamp": "1647422757", "status": "online", "specific_info": {"generation_id": "cortx-data-ssc-vm-g4-rhev4-0534-5f965b98c9-mjxzz", "is_status": "True"}}, {"event_timestamp": "1647422428", "created_timestamp": "1647422429", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647422757", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/node/eaaa21eb42ce4ba1ada69b577cf9a3b5/health:{"events": [{"event_timestamp": "1647422754.7373736", "created_timestamp": "1647422755", "status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-g4-rhev4-0579-5cdb86b46b-qxhhf", "is_status": "True"}}, {"event_timestamp": "1647422430", "created_timestamp": "1647422430", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647422755", "status": "pending"}, "attributes": {}} cortx/ha/v1/cortx/ha/system/cluster/1a3acac8233d425c8dbfad522a99feca/site/0xBAADFOOD/rack/0xBAADFOOD/node/efbc2702ad0240fb82e33c9ddd08715c/health:{"events": [{"event_timestamp": "1647423426.8650668", "created_timestamp": "1647423426", "status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-g4-rhev4-0534-5659f44fc8-nntj6"}}, {"event_timestamp": "1647423232.185856", "created_timestamp": "1647423232", "status": "failed", "specific_info": {"generation_id": "cortx-server-ssc-vm-g4-rhev4-0534-5659f44fc8-dj5hm"}}], "action": {"modified_timestamp": "1647423426", "status": "pending"}, "attributes": {}} cortx/ha/v1/events/disk/failed:["hare"] cortx/ha/v1/events/disk/online:["hare"] cortx/ha/v1/events/node/failed:["hare"] cortx/ha/v1/events/node/online:["hare"] cortx/ha/v1/events/subscribe/hare:["node/online", "node/failed", "disk/online", "disk/failed"] cortx/ha/v1/message_type/hare:ha_event_hare [root@cortx-ha-headless-svc /]#