canonical / zookeeper-operator

Source for Zookeeper VM Charm
Apache License 2.0
3 stars 7 forks source link

Uncaught exception on `healthy` check #120

Closed zmraul closed 7 months ago

zmraul commented 7 months ago

Handling an update_status event while the service is down triggers an exception on src/workload.py -> healthy check. The 10s timeout raises and the hook is not finished normally.

This issue happened on a full cluster crash HA test on koozeeper-k8s

Log output

INFO     pytest_operator.plugin:plugin.py:784 Model status:

Model         Controller                Cloud/Region        Version  SLA          Timestamp
test-ha-784h  github-pr-538bf-microk8s  microk8s/localhost  3.1.6    unsupported  11:09:27Z

App            Version  Status   Scale  Charm          Channel  Rev  Address         Exposed  Message
zookeeper-k8s           waiting      3  zookeeper-k8s             0  10.152.183.206  no       waiting for units to settle down

Unit              Workload  Agent  Address      Ports  Message
zookeeper-k8s/0*  active    idle   10.1.209.80         
zookeeper-k8s/1   active    idle   10.1.209.78         
zookeeper-k8s/2   error     idle   10.1.209.79         hook failed: "update-status"

INFO     pytest_operator.plugin:plugin.py:790 Juju error logs:

unit-zookeeper-k8s-0: 10:58:35 ERROR unit.zookeeper-k8s/0.juju-log Cluster upgrade failed, ensure pre-upgrade checks are ran first.
unit-zookeeper-k8s-0: 10:58:53 ERROR unit.zookeeper-k8s/0.juju-log zookeeper service is unreachable or not serving requests
unit-zookeeper-k8s-0: 10:59:02 ERROR unit.zookeeper-k8s/0.juju-log zookeeper service is unreachable or not serving requests
unit-zookeeper-k8s-2: 11:08:02 ERROR unit.zookeeper-k8s/2.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-zookeeper-k8s-2/charm/./src/charm.py", line 457, in <module>
    main(ZooKeeperCharm)
  File "/var/lib/juju/agents/unit-zookeeper-k8s-2/charm/venv/ops/main.py", line 436, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-zookeeper-k8s-2/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-zookeeper-k8s-2/charm/venv/ops/framework.py", line 351, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-zookeeper-k8s-2/charm/venv/ops/framework.py", line 853, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-zookeeper-k8s-2/charm/venv/ops/framework.py", line 942, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-zookeeper-k8s-2/charm/./src/charm.py", line 229, in _on_cluster_relation_changed
    if self.state.unit_server.started and not self.workload.healthy:
  File "/var/lib/juju/agents/unit-zookeeper-k8s-2/charm/venv/tenacity/__init__.py", line 289, in wrapped_f
    return self(f, *args, **kw)
  File "/var/lib/juju/agents/unit-zookeeper-k8s-2/charm/venv/tenacity/__init__.py", line 379, in __call__
    do = self.iter(retry_state=retry_state)
  File "/var/lib/juju/agents/unit-zookeeper-k8s-2/charm/venv/tenacity/__init__.py", line 314, in iter
    return fut.result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/var/lib/juju/agents/unit-zookeeper-k8s-2/charm/venv/tenacity/__init__.py", line 382, in __call__
    result = fn(*args, **kwargs)
  File "/var/lib/juju/agents/unit-zookeeper-k8s-2/charm/src/workload.py", line 92, in healthy
    ruok_response = self.exec(command=timeout + ruok)
  File "/var/lib/juju/agents/unit-zookeeper-k8s-2/charm/src/workload.py", line 60, in exec
    return str(self.container.exec(command, working_dir=working_dir).wait_output())
  File "/var/lib/juju/agents/unit-zookeeper-k8s-2/charm/venv/ops/pebble.py", line 1441, in wait_output
    raise ExecError[AnyStr](self._command, exit_code, out_value, err_value)
ops.pebble.ExecError: non-zero exit code 124 executing ['timeout', '10s', 'bash', '-c', "echo 'ruok' | (exec 3<>/dev/tcp/localhost/2181; cat >&3; cat <&3; exec 3<&-)"], stdout='', stderr=''
unit-zookeeper-k8s-2: 11:08:02 ERROR juju.worker.uniter.operation hook "update-status" (via hook dispatching script: dispatch) failed: exit status 1
github-actions[bot] commented 7 months ago

https://warthogs.atlassian.net/browse/DPE-3610