Seagate / cortx-ha

CORTX ha (High-Availability) is responsible for ensuring that CORTX Solution is available in case of any hardware component or software service failures. It takes care of failover/ failback control flow for affected services and stabilizes them across CORTX cluster.
https://github.com/Seagate/cortx
GNU Affero General Public License v3.0
4 stars 45 forks source link

CORTX-28605 Subscribe to disk events and changes to ActionHandler to send disk alerts to Hare #657

Closed ArchanaLimaye closed 2 years ago

ArchanaLimaye commented 2 years ago

Problem Statement

Handling needs to be added to Subscribe for disk events from Hare and to send disk alerts to Hare

Design

Changes added for Hare subscription for resource type disk in HA mini provisioning Action Handlers added for disk and cvg resource types

LLD at https://seagate-systems.atlassian.net/wiki/spaces/PUB/pages/891781277/Action+Status+Update+LLD

Coding

Testing

https://jts.seagate.com/secure/attachment/506958/testing_details_CORTX-28605.txt

Review Checklist

Review Checklist

Documentation

Checklist for Author

ArchanaLimaye commented 2 years ago

Testing details [Detailed logs added in https://jts.seagate.com/secure/attachment/506958/testing_details_CORTX-28605.txt]

  1. Testing the event subscription :

    Deployed CORTX on 3N setup. Created HA RPM and installed it in the HA POD

    a. Checked the system health keys in consul using "consul kv get -http-addr=consul-server.default.svc.cluster.local:8500 --recurse cortx/ha/v1" b. Confirmed that the keys for disk resource are not present c. Executed ha _setup explicitly : /usr/lib/python3.6/site-packages/ha/k8s_setup/ha_setup.py config --services health_monitor --config yaml:///etc/cortx/cluster.conf d. Verified that the system health in consul shows all keys for disk subscription

cortx/ha/v1: cortx/ha/v1/action/disk/failed:["publish"] cortx/ha/v1/action/disk/online:["publish"] cortx/ha/v1/events/disk/failed:["hare"] cortx/ha/v1/events/disk/online:["hare"] cortx/ha/v1/events/subscribe/hare:["node/online", "node/failed", "disk/online", "disk/failed"] cortx/ha/v1/message_type/hare:ha_event_hare

  1. Testing publish of disk event to Hare

    a. Using test code created disk online system health event and published it on message bus for message_type "health_events" b. Verified by looking at health_monitor.log file that this event is

    • Received by health monitor
    • processed and sent out to Hare c. Did the above steps for disk online and disk failed and verified that for both events , alert is sent to Hare d. Similarly created cvg online, cvg failed events and published it on message bus for message_type "health_events" e. verified that event is
    • Received by health monitor
    • Not sent out to Hare since subscribtion is not done for CVG resource type f. Explicitly subscribed for cvg resource in ha_setup and re0-executed ha_setup. g. Confirmed that after subscribption, cvg online and failed events are sent to Hare f. Created and sent event for disk with resource_status "unknown". Confirmed that event is parsed by health monitor but not sent to Hare
ArchanaLimaye commented 2 years ago

@mukhtar-inamdar

Just 1 query why subscription for CVG is not made as other changes related to CVG you are commiting As per the LLD HA is not going to send and CVG event to Hare. So subscription should not be done for them