Seagate / cortx-ha

CORTX ha (High-Availability) is responsible for ensuring that CORTX Solution is available in case of any hardware component or software service failures. It takes care of failover/ failback control flow for affected services and stabilizes them across CORTX cluster.
https://github.com/Seagate/cortx
GNU Affero General Public License v3.0
4 stars 45 forks source link

CORTX-28894 : Update HA mini provisioning to create system health key… #667

Closed akash2144 closed 2 years ago

akash2144 commented 2 years ago

…s for disk resources

Signed-off-by: Akash Dudhane akash.a.dudhane@seagate.com

Problem Statement

Design

Coding

Testing

Review Checklist

Review Checklist

Documentation

Checklist for Author

Test Details :

With Latest code changes consul keys are :

[root@cortx-ha-headless-svc /]# consul kv get -http-addr=consul-server.default.svc.cluster.local:8500 --recurse "cortx>ha"   cortx>ha>v1:
cortx>ha>v1>action>disk>failed:["publish"]
cortx>ha>v1>action>disk>online:["publish"]
cortx>ha>v1>action>node>failed:["publish"]
cortx>ha>v1>action>node>online:["publish"]
cortx>ha>v1>cluster_cardinality:{"num_nodes": 6, "node_list": ["113037cdcd2443bda5b6b47acd312421", "9b235fdd12774ea48c71dd7835461277", "623a56a21ab840fea48fafae92771e1d", "39561d21ffc04a3398b9cbd2e26b32ff", "32c0076d1ce14e229634a5c23cdafc12", "2aab06e9b2104623b7e74da0f3fbdfd4"]}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>health:{"events": [{"event_timestamp": "1647865470.8977857", "created_timestamp": "1647865471", "status": "degraded", "specific_info": null}], "action": {"modified_timestamp": "1647865471", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>health:{"events": [{"event_timestamp": "1647865470.8977857", "created_timestamp": "1647865471", "status": "degraded", "specific_info": null}], "action": {"modified_timestamp": "1647865471", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>health:{"events": [{"event_timestamp": "1647865470.8977857", "created_timestamp": "1647865471", "status": "degraded", "specific_info": null}], "action": {"modified_timestamp": "1647865471", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>113037cdcd2443bda5b6b47acd312421>cvg>cvg-01>disk>/dev/sdc>health:{"events": [{"event_timestamp": "1647865519", "created_timestamp": "1647865519", "status": "unknown", "specific_info": {"cvg_id": "cvg-01"}}], "action": {"modified_timestamp": "1647865519", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>113037cdcd2443bda5b6b47acd312421>cvg>cvg-01>disk>/dev/sdd>health:{"events": [{"event_timestamp": "1647865519", "created_timestamp": "1647865519", "status": "unknown", "specific_info": {"cvg_id": "cvg-01"}}], "action": {"modified_timestamp": "1647865519", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>113037cdcd2443bda5b6b47acd312421>cvg>cvg-01>disk>/dev/sde>health:{"events": [{"event_timestamp": "1647865519", "created_timestamp": "1647865519", "status": "unknown", "specific_info": {"cvg_id": "cvg-01"}}], "action": {"modified_timestamp": "1647865519", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>113037cdcd2443bda5b6b47acd312421>cvg>cvg-01>health:{"events": [{"event_timestamp": "1647865519", "created_timestamp": "1647865519", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647865519", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>113037cdcd2443bda5b6b47acd312421>cvg>cvg-02>disk>/dev/sdf>health:{"events": [{"event_timestamp": "1647865520", "created_timestamp": "1647865520", "status": "unknown", "specific_info": {"cvg_id": "cvg-02"}}], "action": {"modified_timestamp": "1647865520", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>113037cdcd2443bda5b6b47acd312421>cvg>cvg-02>disk>/dev/sdg>health:{"events": [{"event_timestamp": "1647865520", "created_timestamp": "1647865520", "status": "unknown", "specific_info": {"cvg_id": "cvg-02"}}], "action": {"modified_timestamp": "1647865520", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>113037cdcd2443bda5b6b47acd312421>cvg>cvg-02>disk>/dev/sdh>health:{"events": [{"event_timestamp": "1647865520", "created_timestamp": "1647865520", "status": "unknown", "specific_info": {"cvg_id": "cvg-02"}}], "action": {"modified_timestamp": "1647865520", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>113037cdcd2443bda5b6b47acd312421>cvg>cvg-02>health:{"events": [{"event_timestamp": "1647865519", "created_timestamp": "1647865519", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647865519", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>113037cdcd2443bda5b6b47acd312421>health:{"events": [{"event_timestamp": "1647865520.8414278", "created_timestamp": "1647865521", "status": "starting", "specific_info": {"generation_id": "cortx-data-ssc-vm-g4-rhev4-0762-68dfc746cd-j2w6g", "pod_restart": 0}}, {"event_timestamp": "1647865517", "created_timestamp": "1647865517", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647865521", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>2aab06e9b2104623b7e74da0f3fbdfd4>cvg>cvg-01>disk>/dev/sdc>health:{"events": [{"event_timestamp": "1647865521", "created_timestamp": "1647865521", "status": "unknown", "specific_info": {"cvg_id": "cvg-01"}}], "action": {"modified_timestamp": "1647865521", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>2aab06e9b2104623b7e74da0f3fbdfd4>cvg>cvg-01>disk>/dev/sdd>health:{"events": [{"event_timestamp": "1647865520", "created_timestamp": "1647865520", "status": "unknown", "specific_info": {"cvg_id": "cvg-01"}}], "action": {"modified_timestamp": "1647865520", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>2aab06e9b2104623b7e74da0f3fbdfd4>cvg>cvg-01>disk>/dev/sde>health:{"events": [{"event_timestamp": "1647865521", "created_timestamp": "1647865521", "status": "unknown", "specific_info": {"cvg_id": "cvg-01"}}], "action": {"modified_timestamp": "1647865521", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>2aab06e9b2104623b7e74da0f3fbdfd4>cvg>cvg-01>health:{"events": [{"event_timestamp": "1647865520", "created_timestamp": "1647865520", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647865520", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>2aab06e9b2104623b7e74da0f3fbdfd4>cvg>cvg-02>disk>/dev/sdf>health:{"events": [{"event_timestamp": "1647865521", "created_timestamp": "1647865521", "status": "unknown", "specific_info": {"cvg_id": "cvg-02"}}], "action": {"modified_timestamp": "1647865521", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>2aab06e9b2104623b7e74da0f3fbdfd4>cvg>cvg-02>disk>/dev/sdg>health:{"events": [{"event_timestamp": "1647865521", "created_timestamp": "1647865521", "status": "unknown", "specific_info": {"cvg_id": "cvg-02"}}], "action": {"modified_timestamp": "1647865521", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>2aab06e9b2104623b7e74da0f3fbdfd4>cvg>cvg-02>disk>/dev/sdh>health:{"events": [{"event_timestamp": "1647865521", "created_timestamp": "1647865521", "status": "unknown", "specific_info": {"cvg_id": "cvg-02"}}], "action": {"modified_timestamp": "1647865521", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>2aab06e9b2104623b7e74da0f3fbdfd4>cvg>cvg-02>health:{"events": [{"event_timestamp": "1647865521", "created_timestamp": "1647865521", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647865521", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>2aab06e9b2104623b7e74da0f3fbdfd4>health:{"events": [{"event_timestamp": "1647865520.7422156", "created_timestamp": "1647865520", "status": "starting", "specific_info": {"generation_id": "cortx-data-ssc-vm-g4-rhev4-0763-db76f5f4f-zfsts", "pod_restart": 0}}, {"event_timestamp": "1647865519", "created_timestamp": "1647865519", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647865520", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>32c0076d1ce14e229634a5c23cdafc12>cvg>cvg-01>disk>/dev/sdc>health:{"events": [{"event_timestamp": "1647865520", "created_timestamp": "1647865520", "status": "unknown", "specific_info": {"cvg_id": "cvg-01"}}], "action": {"modified_timestamp": "1647865520", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>32c0076d1ce14e229634a5c23cdafc12>cvg>cvg-01>disk>/dev/sdd>health:{"events": [{"event_timestamp": "1647865520", "created_timestamp": "1647865520", "status": "unknown", "specific_info": {"cvg_id": "cvg-01"}}], "action": {"modified_timestamp": "1647865520", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>32c0076d1ce14e229634a5c23cdafc12>cvg>cvg-01>disk>/dev/sde>health:{"events": [{"event_timestamp": "1647865520", "created_timestamp": "1647865520", "status": "unknown", "specific_info": {"cvg_id": "cvg-01"}}], "action": {"modified_timestamp": "1647865520", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>32c0076d1ce14e229634a5c23cdafc12>cvg>cvg-01>health:{"events": [{"event_timestamp": "1647865520", "created_timestamp": "1647865520", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647865520", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>32c0076d1ce14e229634a5c23cdafc12>cvg>cvg-02>disk>/dev/sdf>health:{"events": [{"event_timestamp": "1647865520", "created_timestamp": "1647865520", "status": "unknown", "specific_info": {"cvg_id": "cvg-02"}}], "action": {"modified_timestamp": "1647865520", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>32c0076d1ce14e229634a5c23cdafc12>cvg>cvg-02>disk>/dev/sdg>health:{"events": [{"event_timestamp": "1647865520", "created_timestamp": "1647865520", "status": "unknown", "specific_info": {"cvg_id": "cvg-02"}}], "action": {"modified_timestamp": "1647865520", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>32c0076d1ce14e229634a5c23cdafc12>cvg>cvg-02>disk>/dev/sdh>health:{"events": [{"event_timestamp": "1647865520", "created_timestamp": "1647865520", "status": "unknown", "specific_info": {"cvg_id": "cvg-02"}}], "action": {"modified_timestamp": "1647865520", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>32c0076d1ce14e229634a5c23cdafc12>cvg>cvg-02>health:{"events": [{"event_timestamp": "1647865520", "created_timestamp": "1647865520", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647865520", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>32c0076d1ce14e229634a5c23cdafc12>health:{"events": [{"event_timestamp": "1647865521.0039294", "created_timestamp": "1647865522", "status": "starting", "specific_info": {"generation_id": "cortx-data-ssc-vm-g4-rhev4-0314-7f7d8bcc6c-kd7cx", "pod_restart": 0}}, {"event_timestamp": "1647865518", "created_timestamp": "1647865518", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647865522", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>39561d21ffc04a3398b9cbd2e26b32ff>health:{"events": [{"event_timestamp": "1647865520.784368", "created_timestamp": "1647865521", "status": "starting", "specific_info": {"generation_id": "cortx-server-ssc-vm-g4-rhev4-0762-5786949454-9jzl2", "pod_restart": 0}}, {"event_timestamp": "1647865518", "created_timestamp": "1647865518", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647865521", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>623a56a21ab840fea48fafae92771e1d>health:{"events": [{"event_timestamp": "1647865521.0802948", "created_timestamp": "1647865522", "status": "starting", "specific_info": {"generation_id": "cortx-server-ssc-vm-g4-rhev4-0314-5c54f98d4-fv82d", "pod_restart": 0}}, {"event_timestamp": "1647865518", "created_timestamp": "1647865518", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647865522", "status": "pending"}, "attributes": {}}
cortx>ha>v1>cortx>ha>system>cluster>118196a1d1094388be697ba3db13e367>site>NOT_DEFINED>rack>NOT_DEFINED>node>9b235fdd12774ea48c71dd7835461277>health:{"events": [{"event_timestamp": "1647865520.8792374", "created_timestamp": "1647865521", "status": "starting", "specific_info": {"generation_id": "cortx-server-ssc-vm-g4-rhev4-0763-68c947c544-w6cth", "pod_restart": 0}}, {"event_timestamp": "1647865518", "created_timestamp": "1647865518", "status": "unknown", "specific_info": null}], "action": {"modified_timestamp": "1647865521", "status": "pending"}, "attributes": {}}
cortx>ha>v1>events>disk>failed:["hare"]
cortx>ha>v1>events>disk>online:["hare"]
cortx>ha>v1>events>node>failed:["hare"]
cortx>ha>v1>events>node>online:["hare"]
cortx>ha>v1>events>subscribe>hare:["node>online", "node>failed", "disk>online", "disk>failed"]
cortx>ha>v1>message_type>hare:ha_event_hare
akash2144 commented 2 years ago

please check if your disk event will have ever specific_info i.e. event['payload']['specific_info']. if ever it will have some object and not a generation_id then it will break in parser. and generation_id is only for node event and not disk so it will fail if tomorrow if you have any object in 'specific_info' so let's change this also.

file: /root/cortx-ha/ha/core/event_analyzer/parser/parser.py
line: 191
if specific_info and specific_info["generation_id"] and source == HEALTH_EVENT_SOURCES.MONITOR.value:

so you can make this check only for node events.

This check is added in PR#644. Currently we didn't receive any disk event from monitor. https://github.com/Seagate/cortx-ha/pull/664/files#:~:text=if%20resource_type%20%3D%3D%20CLUSTER_ELEMENTS.NODE.value%3A

shaileshsvc commented 2 years ago

Please refresh the test output with latest code, this should reflect > as delimiter now.

akash2144 commented 2 years ago

Please refresh the test output with latest code, this should reflect > as delimiter now.

Done. Updated test results.

Resolved.

akash2144 commented 2 years ago

Can u please rebase branch with latest fault_k8s

It's already rebased with latest fault_k8s. Please let me know if you see any warnings on PR.

ArchanaLimaye commented 2 years ago

@akash2144 from the test results,

  • looks like your setup had only 4 pods ( 2 server , 2 data I guess) . Is this as expected ? There should be 3+3 right ? Don't know why only 2 pods get created. I used following jenkins job for deployment on 3 node cluster : https://eos-jenkins.colo.seagate.com/job/Cortx-Automation/job/RGW/job/setup-cortx-rgw-cluster/ With same job on another 3 node cluster, 6 pods gets created (3 server + 3 data). Test details are added in PR description.

  • /dev/sda and /dev/sdb are missing. Is this expected as per cluster.conf Yes, As per cluster.conf we only have 2 data disk & 1 metadata disk per device

  • devices: data:

  • /dev/sdd

  • /dev/sde metadata:

  • /dev/sdc num_data: 2 num_metadata: 1 name: cvg-01 type: ios

  • devices: data:

  • /dev/sdg

  • /dev/sdh metadata:

  • /dev/sdf num_data: 2 num_metadata: 1 name: cvg-02 type: ios cvg_count: 2