canonical / mysql-k8s-operator

A Charmed Operator for running MySQL on Kubernetes
https://charmhub.io/mysql-k8s
Apache License 2.0
8 stars 15 forks source link

DPE-3027 Retry policy for is_mysqld_running #356

Closed RafalSiwek closed 7 months ago

RafalSiwek commented 8 months ago

Issue

Charm deployment lands in an error state, mostly on GitHub Actions runners. The error is caused by config-changed hook failure when calling the is_mysqld_running method.

Error snippet out of `kubectl logs mysql-k8s` ``` unit-mysql-k8s-0: 16:48:16 ERROR unit.mysql-k8s/0.juju-log Uncaught exception while in charm code: Traceback (most recent call last): File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/./src/charm.py", line 786, in main(MySQLOperatorCharm) File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/main.py", line 436, in main _emit_charm_event(charm, dispatcher.event_name) File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/main.py", line 144, in _emit_charm_event event_to_emit.emit(*args, **kwargs) File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/framework.py", line 351, in emit framework._emit(event) File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/framework.py", line 853, in _emit self._reemit(event_path) File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/framework.py", line 942, in _reemit custom_handler(event) File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/./src/charm.py", line 412, in _on_config_changed if not self._mysql.is_mysqld_running(): File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/src/mysql_k8s_helpers.py", line 564, in is_mysqld_running return self.container.exists(MYSQLD_SOCK_FILE) File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/model.py", line 2547, in exists self._pebble.list_files(str(path), itself=True) File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/pebble.py", line 2219, in list_files resp = self._request('GET', '/v1/files', query) File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/pebble.py", line 1655, in _request response = self._request_raw(method, path, query, headers, data) File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/pebble.py", line 1704, in _request_raw raise ConnectionError( ops.pebble.ConnectionError: Could not connect to Pebble: socket not found at '/charm/containers/mysql/pebble.socket' (container restarted?) unit-mysql-k8s-0: 16:48:16 ERROR juju.worker.uniter.operation hook "config-changed" (via hook dispatching script: dispatch) failed: exit status 1 ```

Solution

Wrap is_mysqld_running method in tenacity.retry to implement retries on ops.pebble.ConnectionError

paulomach commented 7 months ago

Hi @RafalSiwek , although the PR makes sense, I believe that it will mask another issue. Please take a look a this comment on the original issue, as I thinks there's a better solution

RafalSiwek commented 7 months ago

Hi @paulomach, thank you for reviewing it. That's a valid point. However, I believe that since this method serves as a kind of health check, it might benefit from incorporating error handling, timeouts, or a retry policy, especially for the testing profile. The code calling this method assumes an errorless response and transitions the hook into an "action-required state" when an exception occurs:

if not self._mysql.is_mysqld_running():
  # defer config-changed event until MySQL is running
  logger.debug("Deferring config-changed event until MySQL is running")
  event.defer()
  return

While the Pebble ops socket handler implements a timeout policy, it is not configurable. This issue addresses the lack of configurability.

Do you think implementing a testing profile-bound retry or handling ops.pebble.ConnectionError errors would be a compatible approach in this case?