Seagate / cortx-ha

CORTX ha (High-Availability) is responsible for ensuring that CORTX Solution is available in case of any hardware component or software service failures. It takes care of failover/ failback control flow for affected services and stabilizes them across CORTX cluster.
https://github.com/Seagate/cortx
GNU Affero General Public License v3.0
4 stars 45 forks source link

CORTX-28915: Modify HA code to use changes from cortx-utils for Health Event creation #662

Closed gauri-bhosale closed 2 years ago

gauri-bhosale commented 2 years ago

Problem Statement

Design

Coding

Testing

Review Checklist

Documentation

Checklist for Author

Log Result : Fault_tolerance.log

2022-03-08 03:37:00 fault_tolerance [8]: INFO [filter_event] This alert needs an attention: resource_type: node 2022-03-08 03:37:00 fault_tolerance [8]: INFO [process_event] SystemHealth: Processing node:node:89e0d99c15a549ddbdffbe018e2ab928 with status online 2022-03-08 03:37:00 fault_tolerance [8]: INFO [process_message] Received the message from message bus: b'{"header": {"version": "1.0", "timestamp": "1646710620.26266", "event_id": "1646710620.26266cf04cbd032a8486097409128510a836c"}, "payload": {"source": "", "cluster_id": "", "site_id": "", "rack_id": "", "storageset_id": "", "node_id": "97773057d6444bbebf731af3b017d1d6", "resource_type": "node", "resource_id": "97773057d6444bbebf731af3b017d1d6", "resource_status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-rhev4-2221-8486d9f4dc-gp2n5", "is_status": "True"}}}'

2022-03-07 12:17:40 fault_tolerance [8]: INFO [process_event] SystemHealth: Processing node:node:89e0d99c15a549ddbdffbe018e2ab928 with status online 2022-03-07 12:17:40 fault_tolerance [8]: INFO [process_message] Received the message from message bus: b'{"header": {"version": "1.0", "timestamp": "1646655460.5213675", "event_id": "1646655460.52136755de50fa0f1ed4bb9bdaff352b43c59e5"}, "payload": {"source": "", "cluster_id": "", "site_id": "", "rack_id": "", "storageset_id": "", "node_id": "97773057d6444bbebf731af3b017d1d6", "resource_type": "node", "resource_id": "97773057d6444bbebf731af3b017d1d6", "resource_status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-rhev4-2221-8486d9f4dc-gp2n5"}}}' 2022-03-07 12:17:40 fault_tolerance [8]: INFO [init_evaluators] Initialize all the health evaluator elements 2022-03-07 12:17:40 fault_tolerance [8]: INFO [init_evaluators] HealthEvaluator ha.core.system_health.health_evaluators.cluster_health_evaluator.ClusterHealthEvaluator is initalized... 2022-03-07 12:17:40 fault_tolerance [8]: INFO [init_evaluators] HealthEvaluator ha.core.system_health.health_evaluators.site_health_evaluator.SiteHealthEvaluator is initalized...

k8s_monitor.log : 2022-03-07 12:18:35 k8s_resource_monitor [7]: INFO [publish_alert] pod_monitor sending alert on message bus {"header": {"version": "1.0", "timestamp": "1646655515.607267", "event_id": "1646655515.6072679368b55131cb4423b184597381d44b75"}, "payload": {"source": "", "cluster_id": "", "site_id": "", "rack_id": "", "storageset_id": "", "node_id": "97773057d6444bbebf731af3b017d1d6", "resource_type": "node", "resource_id": "97773057d6444bbebf731af3b017d1d6", "resource_status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-rhev4-2221-8486d9f4dc-gp2n5"}}}

health_monitor.log INFO [evaluate] Evaluated action ['publish'] for key action/node/online 2022-03-08 07:29:18 health_monitor [7]: INFO [process_event] Evaluated {"source": "", "event_id": "1646724555.937822626f326bf03ec41c7a60234b0a4ed46a9", "event_type": "online", "severity": "informational", "site_id": "0xBAADFOOD", "rack_id": "0xBAADFOOD", "cluster_id": "a14e59d5c9374714a2dc97523bf03e49", "storageset_id": "97773057d6444bbebf731af3b017d1d6", "node_id": "97773057d6444bbebf731af3b017d1d6", "host_id": "97773057d6444bbebf731af3b017d1d6", "resource_type": "node", "timestamp": "1646724555.9378226", "resource_id": "97773057d6444bbebf731af3b017d1d6", "specific_info": {"generation_id": "cortx-server-ssc-vm-rhev4-2221-8486d9f4dc-gp2n5", "is_status": "True"}} with action ['publish'] 2022-03-08 07:29:18 health_monitor [7]: INFO [act] Default action handler with Event: 1646724555.937822626f326bf03ec41c7a60234b0a4ed46a9 and actions : ['publish'] 2022-03-08 07:29:18 health_monitor [7]: INFO [publish] Sending action event {"header": {"version": "1.0", "timestamp": "1646724558.199793", "event_id": "1646724558.1997939d710554c1c54a14ba96195d5e330e81"}, "payload": {"source": "", "cluster_id": "a14e59d5c9374714a2dc97523bf03e49", "site_id": "0xBAADFOOD", "rack_id": "0xBAADFOOD", "storageset_id": "97773057d6444bbebf731af3b017d1d6", "node_id": "97773057d6444bbebf731af3b017d1d6", "resource_type": "node", "resource_id": "97773057d6444bbebf731af3b017d1d6", "resource_status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-rhev4-2221-8486d9f4dc-gp2n5", "is_status": "True"}}} to component hare 2022-03-08 07:29:18 health_monitor [7]: INFO [evaluate] Evaluated action [] for key action/rack/fault_resolved 2022-03-08 07:29:18 health_monitor [7]: INFO [evaluate] Evaluate

Node failed : 2022-03-08 07:33:12 health_monitor [7]: INFO [evaluate] Evaluated action ['publish'] for key action/node/failed 2022-03-08 07:33:12 health_monitor [7]: INFO [process_event] Evaluated {"source": "", "event_id": "1646724792.254001e8b2f99b1bed4ad28089825a5e2838e0", "event_type": "failed", "severity": "error", "site_id": "0xBAADFOOD", "rack_id": "0xBAADFOOD", "cluster_id": "a14e59d5c9374714a2dc97523bf03e49", "storageset_id": "9b3dbeb08a444dc1a98c14025c4776d2", "node_id": "9b3dbeb08a444dc1a98c14025c4776d2", "host_id": "9b3dbeb08a444dc1a98c14025c4776d2", "resource_type": "node", "timestamp": "1646724792.254001", "resource_id": "9b3dbeb08a444dc1a98c14025c4776d2", "specific_info": {"generation_id": "cortx-data-ssc-vm-rhev4-2222-7d55d95c47-84lvr"}} with action ['publish'] 2022-03-08 07:33:12 health_monitor [7]: INFO [act] Default action handler with Event: 1646724792.254001e8b2f99b1bed4ad28089825a5e2838e0 and actions : ['publish'] 2022-03-08 07:33:12 health_monitor [7]: INFO [publish] Sending action event {"header": {"version": "1.0", "timestamp": "1646724792.5692775", "event_id": "1646724792.569277513d14148928648f4a8a32d616ce9e39b"}, "payload": {"source": "", "cluster_id": "a14e59d5c9374714a2dc97523bf03e49", "site_id": "0xBAADFOOD", "rack_id": "0xBAADFOOD", "storageset_id": "9b3dbeb08a444dc1a98c14025c4776d2", "node_id": "9b3dbeb08a444dc1a98c14025c4776d2", "resource_type": "node", "resource_id": "9b3dbeb08a444dc1a98c14025c4776d2", "resource_status": "failed", "specific_info": {"generation_id": "cortx-data-ssc-vm-rhev4-2222-7d55d95c47-84lvr"}}} to component hare 2022-03-08 07:33:13 health_monitor [7]: INFO [evaluate] Evaluated action [] for key action/rack/threshold_breached:low

cortx-admin commented 2 years ago

Can one of the admins verify this patch?

mariyappanp commented 2 years ago

ha/fault_tolerance/event.py is removed. No impact on HA test dir/unittest ? Please @gauri-bhosale Also, add unittest result.

ArchanaLimaye commented 2 years ago

Please share testing details

ArchanaLimaye commented 2 years ago

PR is showing that ha/core/event_analyzer/parser/parser.py has conflicts with the base branch

indrajitzagade commented 2 years ago

@gauri-bhosale Please add the testing details to this PR.

ArchanaLimaye commented 2 years ago

Please confirm in testing that event from both monitor and Hare is getting created with all expected fields correctly populated after this change

gauri-bhosale commented 2 years ago

@gauri-bhosale Please add the testing details to this PR.

added logs details.

gauri-bhosale commented 2 years ago

PR is showing that ha/core/event_analyzer/parser/parser.py has conflicts with the base branch

conflict resolved

indrajitzagade commented 2 years ago

@gauri-bhosale As discussed please implement the pending LLD comment: https://seagate-systems.atlassian.net/wiki/spaces/PUB/pages/891781277?focusedCommentId=916553923#comment-916553923

akash2144 commented 2 years ago

In health_monitor logs I can see there are different values for "storageset_id" & "generation_field" in [process_event] log_info & "Sending action event" log_info.

INFO [process_event] Evaluated {"source": "", "event_id": "1646653845.45331745eac0f31cfe84f31abbf107e050a74b4", "event_type": "online", "severity": "informational", "site_id": "1", "rack_id": "1", "cluster_id": "a14e59d5c9374714a2dc97523bf03e49", "storageset_id": "89e0d99c15a549ddbdffbe018e2ab928", "node_id": "89e0d99c15a549ddbdffbe018e2ab928", "host_id": "89e0d99c15a549ddbdffbe018e2ab928", "resource_type": "node", "timestamp": "1646653845.4533174", "resource_id": "89e0d99c15a549ddbdffbe018e2ab928", "specific_info": {"generation_id": "cortx-data-ssc-vm-rhev4-2221-6f5dcf55f4-lqs6s"}} with action ['publish']
2022-03-07 11:50:45 health_monitor [8]: INFO [act] Default action handler with Event: 1646653845.45331745eac0f31cfe84f31abbf107e050a74b4 and actions : ['publish']
Sending action event {"header": {"version": "1.0", "timestamp": "1646678416.3081498", "event_id": "1646678416.308149871abae53283445ee88897f040793b3d0"}, "payload": {"source": "", "cluster_id": "a14e59d5c9374714a2dc97523bf03e49", "site_id": "1", "rack_id": "1", "storageset_id": "eb4dd0e836454816a2b44961d3c9ef72", "node_id": "eb4dd0e836454816a2b44961d3c9ef72", "resource_type": "node", "resource_id": "eb4dd0e836454816a2b44961d3c9ef72", "resource_status": "online", "specific_info": {"specific_info": {"generation_id": null}}}} to component hare

Is this expected ?

gauri-bhosale commented 2 years ago

In health_monitor logs I can see there are different values for "storageset_id" & "generation_field" in [process_event] log_info & "Sending action event" log_info.

INFO [process_event] Evaluated {"source": "", "event_id": "1646653845.45331745eac0f31cfe84f31abbf107e050a74b4", "event_type": "online", "severity": "informational", "site_id": "1", "rack_id": "1", "cluster_id": "a14e59d5c9374714a2dc97523bf03e49", "storageset_id": "89e0d99c15a549ddbdffbe018e2ab928", "node_id": "89e0d99c15a549ddbdffbe018e2ab928", "host_id": "89e0d99c15a549ddbdffbe018e2ab928", "resource_type": "node", "timestamp": "1646653845.4533174", "resource_id": "89e0d99c15a549ddbdffbe018e2ab928", "specific_info": {"generation_id": "cortx-data-ssc-vm-rhev4-2221-6f5dcf55f4-lqs6s"}} with action ['publish']
2022-03-07 11:50:45 health_monitor [8]: INFO [act] Default action handler with Event: 1646653845.45331745eac0f31cfe84f31abbf107e050a74b4 and actions : ['publish']
Sending action event {"header": {"version": "1.0", "timestamp": "1646678416.3081498", "event_id": "1646678416.308149871abae53283445ee88897f040793b3d0"}, "payload": {"source": "", "cluster_id": "a14e59d5c9374714a2dc97523bf03e49", "site_id": "1", "rack_id": "1", "storageset_id": "eb4dd0e836454816a2b44961d3c9ef72", "node_id": "eb4dd0e836454816a2b44961d3c9ef72", "resource_type": "node", "resource_id": "eb4dd0e836454816a2b44961d3c9ef72", "resource_status": "online", "specific_info": {"specific_info": {"generation_id": null}}}} to component hare

Is this expected ?

copied log from other locations thats why it mismatched, Rehanged now retested node failure and added logs please check. @akash2144

ArchanaLimaye commented 2 years ago

@gauri-bhosale

Can you please check

gauri-bhosale commented 2 years ago

@gauri-bhosale

  • In existing logs, Why is "source" filed empty in all the outputs ?
  • storageset_id is not coming as NOT_DEFINED

Can you please check

Need to check this why source getting empty and storageset_id value, created separate ticket for that : https://jts.seagate.com/browse/CORTX-29282