Seagate / cortx-ha

CORTX ha (High-Availability) is responsible for ensuring that CORTX Solution is available in case of any hardware component or software service failures. It takes care of failover/ failback control flow for affected services and stabilizes them across CORTX cluster.
https://github.com/Seagate/cortx
GNU Affero General Public License v3.0
4 stars 45 forks source link

CORTX-29282 : source, cluster_id value getting empty string #664

Closed gauri-bhosale closed 2 years ago

gauri-bhosale commented 2 years ago

Problem Statement

  1. Set payload at initialization of HealthEvent object
  2. Fetch cluster_id from conf file and pass it in event object payload. https://jts.seagate.com/browse/CORTX-28915 https://jts.seagate.com/browse/CORTX-29282

Design

Coding

Testing

Review Checklist

Documentation

Checklist for Author

Testing Result : python3 test_no_duplicate_alerts.py **** Testing K8s Not Publishing Duplicate Alerts **** 2022-03-14 18:46:26 test_k8s_resource_monitor [138]: INFO [init] Started logging for service test_k8s_resource_monitor Producer id: mock_producer, message_type: mock_message, partition: 1 2022-03-14 18:46:26 test_k8s_resource_monitor [138]: INFO [init] Initialization done for pod monitor Publishing alert.. {'event': {'header': {'version': '1.0', 'timestamp': '112255', 'event_id': '11225568919c5ba8f88ad24cc18f632a75de115aca'}, 'payload': {'source': 'monitor', 'cluster_id': 'None', 'site_id': 'None', 'rack_id': 'None', 'storageset_id': 'None', 'node_id': 'dummy.colo.seagate.com', 'resource_type': 'host', 'resource_id': 'dummy.colo.seagate.com', 'resource_status': 'online', 'specific_info': {}}}} Publishing alert.. {'event': {'header': {'version': '1.0', 'timestamp': '112233', 'event_id': '11223368911c9b81b8d3c84305a49c191e6a8c3bb5'}, 'payload': {'source': 'monitor', 'cluster_id': 'None', 'site_id': 'None', 'rack_id': 'None', 'storageset_id': 'None', 'node_id': 'dummy0fff6f947e8aecc6a27a25b7329', 'resource_type': 'node', 'resource_id': 'dummy02fff6f947e8aecc6a27a25b7329', 'resource_status': 'online', 'specific_info': {'generation_id': 'cortx-data-dummy0'}}}} Publishing alert.. {'event': {'header': {'version': '1.0', 'timestamp': '112233', 'event_id': '11223368911c9b81b8d3c84305a49c191e6a8c3bb5'}, 'payload': {'source': 'monitor', 'cluster_id': 'None', 'site_id': 'None', 'rack_id': 'None', 'storageset_id': 'None', 'node_id': 'dummy0fff6f947e8aecc6a27a25b7329', 'resource_type': 'node', 'resource_id': 'dummy02fff6f947e8aecc6a27a25b7329', 'resource_status': 'offline', 'specific_info': {'generation_id': 'cortx-data-dummy0'}}}}

python3 test_publisher.py ****Event Publisher**** 2022-03-14 19:19:46 event_manager [282]: INFO [init] Started logging for service event_manager 2022-03-14 19:19:46 event_manager [282]: INFO [init] MessageBus initialized as kafka 2022-03-14 19:19:46 event_manager [282]: INFO [subscribe] Received a subscribe request from csm 2022-03-14 19:19:46 event_manager [282]: INFO [add_rule] Adding rule for key: action/node:fru:disk/failed ,value: publish 2022-03-14 19:19:46 event_manager [282]: WARNING [add_rule] key value already exists for action/node:fru:disk/failed , publish 2022-03-14 19:19:46 event_manager [282]: INFO [subscribe] Successfully Subscribed component csm with message_type ha_event_csm Subscribed csm, message type is ha_event_csm Consuming the action event 2022-03-14 19:19:46 event_manager [282]: INFO [start] Starting the daemon for ha_event_csm-consumer-thread... 2022-03-14 19:19:46 event_manager [282]: INFO [start] The daemon ha_event_csm-consumer-thread started successfully.

Logs : fault_tolerance.log 2022-03-10 18:39:15 fault_tolerance [7]: INFO [init] ClusterResource Parser is initialized ... 2022-03-10 18:39:15 fault_tolerance [7]: INFO [filter_event] This alert needs an attention: resource_type: node 2022-03-10 18:39:15 fault_tolerance [7]: INFO [process_event] SystemHealth: Processing node:node:23c332ba0b454f8e870e80b003bfb167 with status starting 2022-03-10 18:39:15 fault_tolerance [7]: INFO [process_message] Received the message from message bus: b'{"header": {"version": "1.0", "timestamp": "1646937555.878105", "event_id": "1646937555.87810536db70ee97894beb93c823ebe6c9ce6f"}, "payload": {"source": "monitor", "cluster_id": "", "site_id": "NOT_DEFINED", "rack_id": "NOT_DEFINED", "storageset_id": "NOT_DEFINED", "node_id": "053b6c480c454a789b9f8730159d1b43", "resource_type": "node", "resource_id": "053b6c480c454a789b9f8730159d1b43", "resource_status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-rhev4-2222-685d69c6df-hpb65", "is_status": "True"}}}'

k8s monitor log : 2022-03-15 02:36:43 k8s_resource_monitor [8]: INFO [publish_alert] pod_monitor sending alert on message bus {"header": {"version": "1.0", "timestamp": "1647311803.5148876", "event_id": "1647311803.5148876db8a9ff771e8430dad4fa2b9d098c2b1"}, "payload": {"source": "monitor", "cluster_id": "", "site_id": "NOT_DEFINED", "rack_id": "NOT_DEFINED", "storageset_id": "NOT_DEFINED", "node_id": "373e44137a484616b541494e98ddfd4e", "resource_type": "node", "resource_id": "373e44137a484616b541494e98ddfd4e", "resource_status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-rhev4-2221-5fc6f8c957-7jjs9"}}} 2022-03-15 02:36:43 k8s_resource_monitor [8]: INFO [publish_alert] pod_monitor sending alert on message bus {"header": {"version": "1.0", "timestamp": "1647311803.5997305", "event_id": "1647311803.5997305a56de202465944c5b4133cbda20b086e"}, "payload": {"source": "monitor", "cluster_id": "", "site_id": "NOT_DEFINED", "rack_id": "NOT_DEFINED", "storageset_id": "NOT_DEFINED", "node_id": "bb64fb98daaf4e0d813751a53d0da1e0", "resource_type": "node", "resource_id": "bb64fb98daaf4e0d813751a53d0da1e0", "resource_status": "online", "specific_info": {"generation_id": "cortx-data-ssc-vm-rhev4-2221-6897946bc6-tlnmd"}}}

health monitor : 2022-03-10 18:38:47 health_monitor [7]: INFO [evaluate] Evaluated action ['publish'] for key action/node/online 2022-03-10 18:38:47 health_monitor [7]: INFO [process_event] Evaluated {"source": "monitor", "event_id": "1646937526.69200923cfed457fb7b4e4b97f55c1e0a1a5517", "event_type": "online", "severity": "informational", "site_id": "NOT_DEFINED", "rack_id": "NOT_DEFINED", "cluster_id": "37671003fda5437fab71fe203a128abd", "storageset_id": "373e44137a484616b541494e98ddfd4e", "node_id": "373e44137a484616b541494e98ddfd4e", "host_id": "373e44137a484616b541494e98ddfd4e", "resource_type": "node", "timestamp": "1646937526.6920092", "resource_id": "373e44137a484616b541494e98ddfd4e", "specific_info": {"generation_id": "cortx-server-ssc-vm-rhev4-2221-5fc6f8c957-7jjs9", "pod_restart": 0}} with action ['publish'] 2022-03-10 18:38:47 health_monitor [7]: INFO [act] Default action handler with Event: 1646937526.69200923cfed457fb7b4e4b97f55c1e0a1a5517 and actions : ['publish'] 2022-03-10 18:38:47 health_monitor [7]: INFO [publish] Sending action event {"header": {"version": "1.0", "timestamp": "1646937527.7339172", "event_id": "1646937527.7339172dca95bc9d8574b5b8a853690344b3e87"}, "payload": {"source": "monitor", "cluster_id": "37671003fda5437fab71fe203a128abd", "site_id": "NOT_DEFINED", "rack_id": "NOT_DEFINED", "storageset_id": "373e44137a484616b541494e98ddfd4e", "node_id": "373e44137a484616b541494e98ddfd4e", "resource_type": "node", "resource_id": "373e44137a484616b541494e98ddfd4e", "resource_status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-rhev4-2221-5fc6f8c957-7jjs9", "pod_restart": 0}}} to component hare

Node fail health_monitor.log : INFO [evaluate] Evaluated action ['publish'] for key action/node/failed 2022-03-10 18:43:41 health_monitor [7]: INFO [process_event] Evaluated {"source": "monitor", "event_id": "1646937821.17777013fed4229dae4469f9ce39b4ea8087a34", "event_type": "failed", "severity": "error", "site_id": "0xBAADFOOD", "rack_id": "0xBAADFOOD", "cluster_id": "37671003fda5437fab71fe203a128abd", "storageset_id": "23c332ba0b454f8e870e80b003bfb167", "node_id": "23c332ba0b454f8e870e80b003bfb167", "host_id": "23c332ba0b454f8e870e80b003bfb167", "resource_type": "node", "timestamp": "1646937821.1777701", "resource_id": "23c332ba0b454f8e870e80b003bfb167", "specific_info": {"generation_id": "cortx-data-ssc-vm-rhev4-2222-677555f8b9-ndvwd", "pod_restart": 0}} with action ['publish'] 2022-03-10 18:43:41 health_monitor [7]: INFO [act] Default action handler with Event: 1646937821.17777013fed4229dae4469f9ce39b4ea8087a34 and actions : ['publish'] 2022-03-10 18:43:41 health_monitor [7]: INFO [publish] Sending action event {"header": {"version": "1.0", "timestamp": "1646937821.3018024", "event_id": "1646937821.3018024c150dff3c79546a5b08bfb39aa89ae22"}, "payload": {"source": "monitor", "cluster_id": "37671003fda5437fab71fe203a128abd", "site_id": "0xBAADFOOD", "rack_id": "0xBAADFOOD", "storageset_id": "23c332ba0b454f8e870e80b003bfb167", "node_id": "23c332ba0b454f8e870e80b003bfb167", "resource_type": "node", "resource_id": "23c332ba0b454f8e870e80b003bfb167", "resource_status": "failed", "specific_info": {"generation_id": "cortx-data-ssc-vm-rhev4-2222-677555f8b9-ndvwd", "pod_restart": 0}}} to component hare

ArchanaLimaye commented 2 years ago

In the system health keys, the value for site, rack id is currently 0xBAADFOOD. @shaileshsvc is this ok or should we have some other value here ex NOT_DEFINED etc ?

ArchanaLimaye commented 2 years ago

In test case for Testing K8s Not Publishing Duplicate Alerts

why are the 'site_id', 'rack_id, 'storageset_id' seen as 'None' ? they are init to 0xBAADFOOD right ?

gauri-bhosale commented 2 years ago

In test case for Testing K8s Not Publishing Duplicate Alerts

why are the 'site_id', 'rack_id, 'storageset_id' seen as 'None' ? they are init to 0xBAADFOOD right ?

In testcase it set mock values "None" of these 3 values. https://github.com/Seagate/cortx-ha/blob/07468caec9563f6f9758145fa7e7b510d7c63af3/ha/test/integration/k8s_monitor/test_no_duplicate_alerts.py#L38

class MockAlert:

@staticmethod
def get_host_alert():
    # Alert without generation_id
    return { "event": {
            "header":{
                "version":"1.0",
                "timestamp":"112255",
                "event_id":"11225568919c5ba8f88ad24cc18f632a75de115aca"
            },
            "payload":{
                "source":"monitor",
                "cluster_id":"None",
                "site_id":"None",
                "rack_id":"None",
                "storageset_id":"None",
                "node_id":"dummy.colo.seagate.com",
                "resource_type":"host",
                "resource_id":"dummy.colo.seagate.com",
                "resource_status":"online",
                "specific_info":{
                }
            }
gauri-bhosale commented 2 years ago

In the system health keys, the value for site, rack id is currently 0xBAADFOOD. @shaileshsvc is this ok or should we have some other value here ex NOT_DEFINED etc ?

Changed to string 'NOT_DEFINED'.