canonical / data-platform-helpers

code for data platform pip packages, for libs please check data-platform-libs
Apache License 2.0
1 stars 0 forks source link

Add wait + tenacity when we restore network #20

Open phvalguima opened 5 days ago

phvalguima commented 5 days ago

Hi team, just came across this issue in one of our opensearch runs:

1) test_network cut is finishing its thing 2) Juju tries to reach to the instance cut off right before the test restores the network:

2024-09-12T22:52:07.3489453Z machine-1: 22:51:43 ERROR juju.worker.dependency "api-caller" manifold worker returned unexpected error: [35e1b8] "machine-1" cannot open api: unable to connect to API: dial tcp 10.114.131.230:17070: connect: no route to host

3) Network restored 4) Unit set to failure right after, because of the failed trial to run update-status hook:

2024-09-12T22:52:07.3542875Z unit-opensearch-dashboards-1: 22:52:04 INFO juju.worker.uniter awaiting error resolution for "update-status" hook

5) wait_for_idle fails as we have a unit in error

Full CI run: https://github.com/canonical/opensearch-dashboards-operator/actions/runs/10831485374/job/30079586227

Proposal

Let's have a wait after each restore network cut call with tenacity:

def restore_network_cut(...):
    ....
    try_wait_model_settles(wait_list)

@tenacity.wait(...)
def try_wait_model_settles(wait_list):
    # Wait for the list of units, if not specified, wait for the entire model to settle
    wait_for_idle(...)
MiaAltieri commented 5 days ago

@phvalguima I could be wrong but I think this change is made in this pr