Closed gauri-bhosale closed 2 years ago
Can one of the admins verify this patch?
ha/fault_tolerance/event.py is removed. No impact on HA test dir/unittest ? Please @gauri-bhosale Also, add unittest result.
Please share testing details
PR is showing that ha/core/event_analyzer/parser/parser.py has conflicts with the base branch
@gauri-bhosale Please add the testing details to this PR.
Please confirm in testing that event from both monitor and Hare is getting created with all expected fields correctly populated after this change
@gauri-bhosale Please add the testing details to this PR.
added logs details.
PR is showing that ha/core/event_analyzer/parser/parser.py has conflicts with the base branch
conflict resolved
@gauri-bhosale As discussed please implement the pending LLD comment: https://seagate-systems.atlassian.net/wiki/spaces/PUB/pages/891781277?focusedCommentId=916553923#comment-916553923
In health_monitor logs I can see there are different values for "storageset_id" & "generation_field" in [process_event] log_info & "Sending action event" log_info.
INFO [process_event] Evaluated {"source": "", "event_id": "1646653845.45331745eac0f31cfe84f31abbf107e050a74b4", "event_type": "online", "severity": "informational", "site_id": "1", "rack_id": "1", "cluster_id": "a14e59d5c9374714a2dc97523bf03e49", "storageset_id": "89e0d99c15a549ddbdffbe018e2ab928", "node_id": "89e0d99c15a549ddbdffbe018e2ab928", "host_id": "89e0d99c15a549ddbdffbe018e2ab928", "resource_type": "node", "timestamp": "1646653845.4533174", "resource_id": "89e0d99c15a549ddbdffbe018e2ab928", "specific_info": {"generation_id": "cortx-data-ssc-vm-rhev4-2221-6f5dcf55f4-lqs6s"}} with action ['publish']
2022-03-07 11:50:45 health_monitor [8]: INFO [act] Default action handler with Event: 1646653845.45331745eac0f31cfe84f31abbf107e050a74b4 and actions : ['publish']
Sending action event {"header": {"version": "1.0", "timestamp": "1646678416.3081498", "event_id": "1646678416.308149871abae53283445ee88897f040793b3d0"}, "payload": {"source": "", "cluster_id": "a14e59d5c9374714a2dc97523bf03e49", "site_id": "1", "rack_id": "1", "storageset_id": "eb4dd0e836454816a2b44961d3c9ef72", "node_id": "eb4dd0e836454816a2b44961d3c9ef72", "resource_type": "node", "resource_id": "eb4dd0e836454816a2b44961d3c9ef72", "resource_status": "online", "specific_info": {"specific_info": {"generation_id": null}}}} to component hare
Is this expected ?
In health_monitor logs I can see there are different values for "storageset_id" & "generation_field" in [process_event] log_info & "Sending action event" log_info.
INFO [process_event] Evaluated {"source": "", "event_id": "1646653845.45331745eac0f31cfe84f31abbf107e050a74b4", "event_type": "online", "severity": "informational", "site_id": "1", "rack_id": "1", "cluster_id": "a14e59d5c9374714a2dc97523bf03e49", "storageset_id": "89e0d99c15a549ddbdffbe018e2ab928", "node_id": "89e0d99c15a549ddbdffbe018e2ab928", "host_id": "89e0d99c15a549ddbdffbe018e2ab928", "resource_type": "node", "timestamp": "1646653845.4533174", "resource_id": "89e0d99c15a549ddbdffbe018e2ab928", "specific_info": {"generation_id": "cortx-data-ssc-vm-rhev4-2221-6f5dcf55f4-lqs6s"}} with action ['publish'] 2022-03-07 11:50:45 health_monitor [8]: INFO [act] Default action handler with Event: 1646653845.45331745eac0f31cfe84f31abbf107e050a74b4 and actions : ['publish'] Sending action event {"header": {"version": "1.0", "timestamp": "1646678416.3081498", "event_id": "1646678416.308149871abae53283445ee88897f040793b3d0"}, "payload": {"source": "", "cluster_id": "a14e59d5c9374714a2dc97523bf03e49", "site_id": "1", "rack_id": "1", "storageset_id": "eb4dd0e836454816a2b44961d3c9ef72", "node_id": "eb4dd0e836454816a2b44961d3c9ef72", "resource_type": "node", "resource_id": "eb4dd0e836454816a2b44961d3c9ef72", "resource_status": "online", "specific_info": {"specific_info": {"generation_id": null}}}} to component hare
Is this expected ?
copied log from other locations thats why it mismatched, Rehanged now retested node failure and added logs please check. @akash2144
@gauri-bhosale
Can you please check
@gauri-bhosale
- In existing logs, Why is "source" filed empty in all the outputs ?
- storageset_id is not coming as NOT_DEFINED
Can you please check
Need to check this why source getting empty and storageset_id value, created separate ticket for that : https://jts.seagate.com/browse/CORTX-29282
Problem Statement
Design
Coding
Testing
Review Checklist
Documentation
Checklist for Author
Log Result : Fault_tolerance.log
2022-03-08 03:37:00 fault_tolerance [8]: INFO [filter_event] This alert needs an attention: resource_type: node 2022-03-08 03:37:00 fault_tolerance [8]: INFO [process_event] SystemHealth: Processing node:node:89e0d99c15a549ddbdffbe018e2ab928 with status online 2022-03-08 03:37:00 fault_tolerance [8]: INFO [process_message] Received the message from message bus: b'{"header": {"version": "1.0", "timestamp": "1646710620.26266", "event_id": "1646710620.26266cf04cbd032a8486097409128510a836c"}, "payload": {"source": "", "cluster_id": "", "site_id": "", "rack_id": "", "storageset_id": "", "node_id": "97773057d6444bbebf731af3b017d1d6", "resource_type": "node", "resource_id": "97773057d6444bbebf731af3b017d1d6", "resource_status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-rhev4-2221-8486d9f4dc-gp2n5", "is_status": "True"}}}'
2022-03-07 12:17:40 fault_tolerance [8]: INFO [process_event] SystemHealth: Processing node:node:89e0d99c15a549ddbdffbe018e2ab928 with status online 2022-03-07 12:17:40 fault_tolerance [8]: INFO [process_message] Received the message from message bus: b'{"header": {"version": "1.0", "timestamp": "1646655460.5213675", "event_id": "1646655460.52136755de50fa0f1ed4bb9bdaff352b43c59e5"}, "payload": {"source": "", "cluster_id": "", "site_id": "", "rack_id": "", "storageset_id": "", "node_id": "97773057d6444bbebf731af3b017d1d6", "resource_type": "node", "resource_id": "97773057d6444bbebf731af3b017d1d6", "resource_status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-rhev4-2221-8486d9f4dc-gp2n5"}}}' 2022-03-07 12:17:40 fault_tolerance [8]: INFO [init_evaluators] Initialize all the health evaluator elements 2022-03-07 12:17:40 fault_tolerance [8]: INFO [init_evaluators] HealthEvaluator ha.core.system_health.health_evaluators.cluster_health_evaluator.ClusterHealthEvaluator is initalized... 2022-03-07 12:17:40 fault_tolerance [8]: INFO [init_evaluators] HealthEvaluator ha.core.system_health.health_evaluators.site_health_evaluator.SiteHealthEvaluator is initalized...
k8s_monitor.log : 2022-03-07 12:18:35 k8s_resource_monitor [7]: INFO [publish_alert] pod_monitor sending alert on message bus {"header": {"version": "1.0", "timestamp": "1646655515.607267", "event_id": "1646655515.6072679368b55131cb4423b184597381d44b75"}, "payload": {"source": "", "cluster_id": "", "site_id": "", "rack_id": "", "storageset_id": "", "node_id": "97773057d6444bbebf731af3b017d1d6", "resource_type": "node", "resource_id": "97773057d6444bbebf731af3b017d1d6", "resource_status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-rhev4-2221-8486d9f4dc-gp2n5"}}}
health_monitor.log INFO [evaluate] Evaluated action ['publish'] for key action/node/online 2022-03-08 07:29:18 health_monitor [7]: INFO [process_event] Evaluated {"source": "", "event_id": "1646724555.937822626f326bf03ec41c7a60234b0a4ed46a9", "event_type": "online", "severity": "informational", "site_id": "0xBAADFOOD", "rack_id": "0xBAADFOOD", "cluster_id": "a14e59d5c9374714a2dc97523bf03e49", "storageset_id": "97773057d6444bbebf731af3b017d1d6", "node_id": "97773057d6444bbebf731af3b017d1d6", "host_id": "97773057d6444bbebf731af3b017d1d6", "resource_type": "node", "timestamp": "1646724555.9378226", "resource_id": "97773057d6444bbebf731af3b017d1d6", "specific_info": {"generation_id": "cortx-server-ssc-vm-rhev4-2221-8486d9f4dc-gp2n5", "is_status": "True"}} with action ['publish'] 2022-03-08 07:29:18 health_monitor [7]: INFO [act] Default action handler with Event: 1646724555.937822626f326bf03ec41c7a60234b0a4ed46a9 and actions : ['publish'] 2022-03-08 07:29:18 health_monitor [7]: INFO [publish] Sending action event {"header": {"version": "1.0", "timestamp": "1646724558.199793", "event_id": "1646724558.1997939d710554c1c54a14ba96195d5e330e81"}, "payload": {"source": "", "cluster_id": "a14e59d5c9374714a2dc97523bf03e49", "site_id": "0xBAADFOOD", "rack_id": "0xBAADFOOD", "storageset_id": "97773057d6444bbebf731af3b017d1d6", "node_id": "97773057d6444bbebf731af3b017d1d6", "resource_type": "node", "resource_id": "97773057d6444bbebf731af3b017d1d6", "resource_status": "online", "specific_info": {"generation_id": "cortx-server-ssc-vm-rhev4-2221-8486d9f4dc-gp2n5", "is_status": "True"}}} to component hare 2022-03-08 07:29:18 health_monitor [7]: INFO [evaluate] Evaluated action [] for key action/rack/fault_resolved 2022-03-08 07:29:18 health_monitor [7]: INFO [evaluate] Evaluate
Node failed : 2022-03-08 07:33:12 health_monitor [7]: INFO [evaluate] Evaluated action ['publish'] for key action/node/failed 2022-03-08 07:33:12 health_monitor [7]: INFO [process_event] Evaluated {"source": "", "event_id": "1646724792.254001e8b2f99b1bed4ad28089825a5e2838e0", "event_type": "failed", "severity": "error", "site_id": "0xBAADFOOD", "rack_id": "0xBAADFOOD", "cluster_id": "a14e59d5c9374714a2dc97523bf03e49", "storageset_id": "9b3dbeb08a444dc1a98c14025c4776d2", "node_id": "9b3dbeb08a444dc1a98c14025c4776d2", "host_id": "9b3dbeb08a444dc1a98c14025c4776d2", "resource_type": "node", "timestamp": "1646724792.254001", "resource_id": "9b3dbeb08a444dc1a98c14025c4776d2", "specific_info": {"generation_id": "cortx-data-ssc-vm-rhev4-2222-7d55d95c47-84lvr"}} with action ['publish'] 2022-03-08 07:33:12 health_monitor [7]: INFO [act] Default action handler with Event: 1646724792.254001e8b2f99b1bed4ad28089825a5e2838e0 and actions : ['publish'] 2022-03-08 07:33:12 health_monitor [7]: INFO [publish] Sending action event {"header": {"version": "1.0", "timestamp": "1646724792.5692775", "event_id": "1646724792.569277513d14148928648f4a8a32d616ce9e39b"}, "payload": {"source": "", "cluster_id": "a14e59d5c9374714a2dc97523bf03e49", "site_id": "0xBAADFOOD", "rack_id": "0xBAADFOOD", "storageset_id": "9b3dbeb08a444dc1a98c14025c4776d2", "node_id": "9b3dbeb08a444dc1a98c14025c4776d2", "resource_type": "node", "resource_id": "9b3dbeb08a444dc1a98c14025c4776d2", "resource_status": "failed", "specific_info": {"generation_id": "cortx-data-ssc-vm-rhev4-2222-7d55d95c47-84lvr"}}} to component hare 2022-03-08 07:33:13 health_monitor [7]: INFO [evaluate] Evaluated action [] for key action/rack/threshold_breached:low