Seagate / cortx-hare

CORTX Hare configures Motr object store, starts/stops Motr services, and notifies Motr of service and device faults.
https://github.com/Seagate/cortx
Apache License 2.0
13 stars 80 forks source link

CORTX-29827: Add unit test simulating single node failure #2065

Closed tshaffe1 closed 2 years ago

tshaffe1 commented 2 years ago

This is a unit test to check behavior on single node failure. We invoke the handler for a state update indicating a process is gone, then check that the appropriate components (drive, controller, node, etc.) are marked failed in Consul. We also check that the failures are broadcast through hax.

This test only covers a particular code path (total node failure). There are other code paths for different scenarios (e.g. drive failure or service crash) that will require additional unit tests.

vaibhavparatwar commented 2 years ago

@SwapnilGaonkar7 @mssawant @Shreya-18 please review

mssawant commented 2 years ago

@trshaffer, looks great, can you please post results of the test run?

tshaffe1 commented 2 years ago

The passing test is visible in the part of the 1 node deployment that ran successfully (https://eos-jenkins.colo.seagate.com/job/Cortx-PR-Build/job/Cortx-Deployment/job/Generic/job/hare/243/console). If I run the test myself, it gives the following output:

# pytest -rP test/test_failure.py
=================================================================== test session starts ====================================================================
platform linux -- Python 3.6.8, pytest-6.2.4, py-1.11.0, pluggy-0.13.1
rootdir: /root/cortx-hare/hax
plugins: timeout-1.4.2, aiohttp-0.3.0, mock-3.6.1, cov-3.0.0
collected 1 item

test/test_failure.py .                                                                                                                               [100%]

========================================================================== PASSES ==========================================================================
_____________________________________________________________ TestFailure.test_process_failure _____________________________________________________________
-------------------------------------------------------------------- Captured log call ---------------------------------------------------------------------
DEBUG    hax:__init__.py:297 Broadcasting HA states [HAState(fid=0x7200000000000001:0x15, status=FAILED)] over ha_link
DEBUG    hax:__init__.py:513 Process fid=0x7200000000000001:0x15 encloses 1 services as follows: [FidWithType(fid=0x7300000000000001:0xe, service_type='ios')]
TRACE    hax:cache.py:79 CACHE: created. fn_name=Motr._generate_sub_disks
DEBUG    hax:__init__.py:550 proc fid=0x7200000000000001:0x15 encloses 1 disks as follows: [0x6b00000000000001:0x11]
DEBUG    hax:__init__.py:634 Notifying node status for process_fid=0x7200000000000001:0x15 state=ObjHealth.FAILED
DEBUG    hax:__init__.py:597 node_fid: 0x6e00000000000001:0x3 encl_fid: 0x6500000000000001:0x4 ctrl_fids: [0x6300000000000001:0x5] with state: ObjHealth.FAILED
==================================================================== 1 passed in 0.12s =====================================================================

@mssawant did you also want full output of the other tests?

mssawant commented 2 years ago

@trshaffer, no, this looks good. Thanks.

vaibhavparatwar commented 2 years ago

retest this please

shailesh-vaidya commented 2 years ago

retest this please