canonical / mysql-k8s-operator

A Charmed Operator for running MySQL on Kubernetes
https://charmhub.io/mysql-k8s
Apache License 2.0
8 stars 15 forks source link

Charm temporarily goes to `Error` during deployment while handling `config-changed`, but eventually proceeds to `Active` #337

Closed ca-scribner closed 7 months ago

ca-scribner commented 9 months ago

Steps to reproduce

This might be reproducable in a smaller test suite, but I've only encountered it during the kfp-api charm's integration tests (tox -e integration -- --model kubeflow). I tried reproducing it by deploying only mysql-k8s, but did not see the same issue. Maybe there's a cpu-load driven race condition?

Expected behavior

The charm progresses to Active without encountering Error

Actual behavior

The charm hits Error during a config-changed hook while deploying. Over time, this error goes away and the charm proceeds to be Active

While the charm does eventually get Active, going into Error on the way feels like there's a foreseeable failure that's uncaught. Plus, in my use case, it resulted in flakey CI because we typically label any Error state as a test fail and end the test right away,

Versions

Operating system:

Juju CLI: 3.1.6

Juju agent: 3.1.6

Charm revision: 8.0/edge

microk8s: 1.25-strict

Log output

kubectl pod logs from a local run ``` kubectl logs mysql-k8s-0 -c charm [3:41:27] 2023-11-17T20:40:57.676Z [pebble] HTTP API server listening on ":38812". 2023-11-17T20:40:57.676Z [pebble] Started daemon. 2023-11-17T20:40:57.682Z [pebble] POST /v1/services 4.454317ms 202 2023-11-17T20:40:57.682Z [pebble] Started default services with change 1. 2023-11-17T20:40:57.685Z [pebble] Service "container-agent" starting: /charm/bin/containeragent unit --data-dir /var/lib/juju --append-env "PATH=$PATH:/charm/bin" --show-log --charm-modified-version 0 2023-11-17T20:40:57.746Z [container-agent] 2023-11-17 20:40:57 INFO juju.cmd supercommand.go:56 running containerAgent [3.1.6 f6a66aa91eec620f5ac04a19d8c06bef03ae6228 gc go1.20.8] 2023-11-17T20:40:57.747Z [container-agent] starting containeragent unit command 2023-11-17T20:40:57.747Z [container-agent] containeragent unit "unit-mysql-k8s-0" start (3.1.6 [gc]) 2023-11-17T20:40:57.747Z [container-agent] 2023-11-17 20:40:57 INFO juju.cmd.containeragent.unit runner.go:578 start "unit" 2023-11-17T20:40:57.747Z [container-agent] 2023-11-17 20:40:57 INFO juju.worker.upgradesteps worker.go:60 upgrade steps for 3.1.6 have already been run. 2023-11-17T20:40:57.749Z [container-agent] 2023-11-17 20:40:57 INFO juju.worker.probehttpserver server.go:157 starting http server on 127.0.0.1:65301 2023-11-17T20:40:57.775Z [container-agent] 2023-11-17 20:40:57 INFO juju.api apiclient.go:1073 cannot resolve "controller-service.controller-local-microk8s.svc.cluster.local": lookup controller-service.controller-local-microk8s.svc.cluster.local: operation was canceled 2023-11-17T20:40:57.775Z [container-agent] 2023-11-17 20:40:57 INFO juju.api apiclient.go:706 connection established to "wss://10.152.183.27:17070/model/ef974c79-cbe5-4b53-81ca-d316c305a28d/api" 2023-11-17T20:40:57.779Z [container-agent] 2023-11-17 20:40:57 INFO juju.worker.apicaller connect.go:163 [ef974c] "unit-mysql-k8s-0" successfully connected to "10.152.183.27:17070" 2023-11-17T20:40:57.801Z [container-agent] 2023-11-17 20:40:57 INFO juju.api apiclient.go:1073 cannot resolve "controller-service.controller-local-microk8s.svc.cluster.local": lookup controller-service.controller-local-microk8s.svc.cluster.local: operation was canceled 2023-11-17T20:40:57.802Z [container-agent] 2023-11-17 20:40:57 INFO juju.api apiclient.go:706 connection established to "wss://10.152.183.27:17070/model/ef974c79-cbe5-4b53-81ca-d316c305a28d/api" 2023-11-17T20:40:57.805Z [container-agent] 2023-11-17 20:40:57 INFO juju.worker.apicaller connect.go:163 [ef974c] "unit-mysql-k8s-0" successfully connected to "10.152.183.27:17070" 2023-11-17T20:40:57.829Z [container-agent] 2023-11-17 20:40:57 INFO juju.worker.migrationminion worker.go:142 migration phase is now: NONE 2023-11-17T20:40:57.829Z [container-agent] 2023-11-17 20:40:57 INFO juju.worker.logger logger.go:120 logger worker started 2023-11-17T20:40:57.834Z [container-agent] 2023-11-17 20:40:57 WARNING juju.worker.proxyupdater proxyupdater.go:241 unable to set snap core settings [proxy.http= proxy.https= proxy.store=]: exec: "snap": executable file not found in $PATH, output: "" 2023-11-17T20:40:57.843Z [container-agent] 2023-11-17 20:40:57 INFO juju.worker.leadership tracker.go:194 mysql-k8s/0 promoted to leadership of mysql-k8s 2023-11-17T20:40:57.849Z [container-agent] 2023-11-17 20:40:57 INFO juju.worker.caasupgrader upgrader.go:113 abort check blocked until version event received 2023-11-17T20:40:57.849Z [container-agent] 2023-11-17 20:40:57 INFO juju.worker.caasupgrader upgrader.go:119 unblocking abort check 2023-11-17T20:40:57.856Z [container-agent] 2023-11-17 20:40:57 INFO juju.agent.tools symlinks.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-mysql-k8s-0 2023-11-17T20:40:57.876Z [container-agent] 2023-11-17 20:40:57 INFO juju.worker.uniter uniter.go:363 unit "mysql-k8s/0" started 2023-11-17T20:40:57.879Z [container-agent] 2023-11-17 20:40:57 INFO juju.worker.uniter uniter.go:689 resuming charm install 2023-11-17T20:40:57.881Z [container-agent] 2023-11-17 20:40:57 INFO juju.worker.uniter.charm bundles.go:81 downloading ch:amd64/jammy/mysql-k8s-110 from API server 2023-11-17T20:40:57.881Z [container-agent] 2023-11-17 20:40:57 INFO juju.downloader download.go:109 downloading from ch:amd64/jammy/mysql-k8s-110 2023-11-17T20:40:58.063Z [container-agent] 2023-11-17 20:40:58 INFO juju.downloader download.go:92 download complete ("ch:amd64/jammy/mysql-k8s-110") 2023-11-17T20:40:58.187Z [container-agent] 2023-11-17 20:40:58 INFO juju.downloader download.go:172 download verified ("ch:amd64/jammy/mysql-k8s-110") 2023-11-17T20:41:07.684Z [pebble] Check "readiness" failure 1 (threshold 3): received non-20x status code 418 2023-11-17T20:41:08.652Z [container-agent] 2023-11-17 20:41:08 INFO juju.worker.uniter uniter.go:389 hooks are retried true 2023-11-17T20:41:08.740Z [container-agent] 2023-11-17 20:41:08 INFO juju.worker.uniter resolver.go:165 found queued "install" hook 2023-11-17T20:41:10.564Z [container-agent] 2023-11-17 20:41:10 INFO juju-log Running legacy hooks/install. 2023-11-17T20:41:12.196Z [container-agent] 2023-11-17 20:41:12 INFO juju.worker.uniter.operation runhook.go:186 ran "install" hook (via hook dispatching script: dispatch) 2023-11-17T20:41:13.549Z [container-agent] 2023-11-17 20:41:13 INFO juju.worker.uniter.operation runhook.go:186 ran "restart-relation-created" hook (via hook dispatching script: dispatch) 2023-11-17T20:41:14.699Z [container-agent] 2023-11-17 20:41:14 INFO juju.worker.uniter.operation runhook.go:186 ran "upgrade-relation-created" hook (via hook dispatching script: dispatch) 2023-11-17T20:41:15.817Z [container-agent] 2023-11-17 20:41:15 INFO juju.worker.uniter.operation runhook.go:186 ran "database-peers-relation-created" hook (via hook dispatching script: dispatch) 2023-11-17T20:41:15.840Z [container-agent] 2023-11-17 20:41:15 INFO juju.worker.uniter resolver.go:165 found queued "leader-elected" hook 2023-11-17T20:41:17.509Z [container-agent] 2023-11-17 20:41:17 INFO juju.worker.uniter.operation runhook.go:186 ran "leader-elected" hook (via hook dispatching script: dispatch) 2023-11-17T20:41:17.679Z [pebble] Check "readiness" failure 2 (threshold 3): received non-20x status code 418 2023-11-17T20:41:18.961Z [container-agent] 2023-11-17 20:41:18 INFO juju.worker.uniter.operation runhook.go:186 ran "database-storage-attached" hook (via hook dispatching script: dispatch) 2023-11-17T20:41:19.746Z [container-agent] 2023-11-17 20:41:19 ERROR juju-log Uncaught exception while in charm code: 2023-11-17T20:41:19.746Z [container-agent] Traceback (most recent call last): 2023-11-17T20:41:19.746Z [container-agent] File "/usr/lib/python3.10/urllib/request.py", line 1348, in do_open 2023-11-17T20:41:19.746Z [container-agent] h.request(req.get_method(), req.selector, req.data, headers, 2023-11-17T20:41:19.746Z [container-agent] File "/usr/lib/python3.10/http/client.py", line 1283, in request 2023-11-17T20:41:19.746Z [container-agent] self._send_request(method, url, body, headers, encode_chunked) 2023-11-17T20:41:19.746Z [container-agent] File "/usr/lib/python3.10/http/client.py", line 1329, in _send_request 2023-11-17T20:41:19.746Z [container-agent] self.endheaders(body, encode_chunked=encode_chunked) 2023-11-17T20:41:19.746Z [container-agent] File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders 2023-11-17T20:41:19.746Z [container-agent] self._send_output(message_body, encode_chunked=encode_chunked) 2023-11-17T20:41:19.746Z [container-agent] File "/usr/lib/python3.10/http/client.py", line 1038, in _send_output 2023-11-17T20:41:19.746Z [container-agent] self.send(msg) 2023-11-17T20:41:19.746Z [container-agent] File "/usr/lib/python3.10/http/client.py", line 976, in send 2023-11-17T20:41:19.746Z [container-agent] self.connect() 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/pebble.py", line 272, in connect 2023-11-17T20:41:19.746Z [container-agent] self.sock.connect(self.socket_path) 2023-11-17T20:41:19.746Z [container-agent] FileNotFoundError: [Errno 2] No such file or directory 2023-11-17T20:41:19.746Z [container-agent] 2023-11-17T20:41:19.746Z [container-agent] During handling of the above exception, another exception occurred: 2023-11-17T20:41:19.746Z [container-agent] 2023-11-17T20:41:19.746Z [container-agent] Traceback (most recent call last): 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/pebble.py", line 1599, in _request_raw 2023-11-17T20:41:19.746Z [container-agent] response = self.opener.open(request, timeout=self.timeout) 2023-11-17T20:41:19.746Z [container-agent] File "/usr/lib/python3.10/urllib/request.py", line 519, in open 2023-11-17T20:41:19.746Z [container-agent] response = self._open(req, data) 2023-11-17T20:41:19.746Z [container-agent] File "/usr/lib/python3.10/urllib/request.py", line 536, in _open 2023-11-17T20:41:19.746Z [container-agent] result = self._call_chain(self.handle_open, protocol, protocol + 2023-11-17T20:41:19.746Z [container-agent] File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain 2023-11-17T20:41:19.746Z [container-agent] result = func(*args) 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/pebble.py", line 286, in http_open 2023-11-17T20:41:19.746Z [container-agent] return self.do_open(_UnixSocketConnection, req, # type:ignore 2023-11-17T20:41:19.746Z [container-agent] File "/usr/lib/python3.10/urllib/request.py", line 1351, in do_open 2023-11-17T20:41:19.746Z [container-agent] raise URLError(err) 2023-11-17T20:41:19.746Z [container-agent] urllib.error.URLError: 2023-11-17T20:41:19.746Z [container-agent] 2023-11-17T20:41:19.746Z [container-agent] During handling of the above exception, another exception occurred: 2023-11-17T20:41:19.746Z [container-agent] 2023-11-17T20:41:19.746Z [container-agent] Traceback (most recent call last): 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/./src/charm.py", line 786, in 2023-11-17T20:41:19.746Z [container-agent] main(MySQLOperatorCharm) 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/main.py", line 441, in main 2023-11-17T20:41:19.746Z [container-agent] _emit_charm_event(charm, dispatcher.event_name) 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/main.py", line 149, in _emit_charm_event 2023-11-17T20:41:19.746Z [container-agent] event_to_emit.emit(*args, **kwargs) 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/framework.py", line 342, in emit 2023-11-17T20:41:19.746Z [container-agent] framework._emit(event) 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/framework.py", line 839, in _emit 2023-11-17T20:41:19.746Z [container-agent] self._reemit(event_path) 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/framework.py", line 928, in _reemit 2023-11-17T20:41:19.746Z [container-agent] custom_handler(event) 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/./src/charm.py", line 412, in _on_config_changed 2023-11-17T20:41:19.746Z [container-agent] if not self._mysql.is_mysqld_running(): 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/src/mysql_k8s_helpers.py", line 564, in is_mysqld_running 2023-11-17T20:41:19.746Z [container-agent] return self.container.exists(MYSQLD_SOCK_FILE) 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/model.py", line 2444, in exists 2023-11-17T20:41:19.746Z [container-agent] self._pebble.list_files(str(path), itself=True) 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/pebble.py", line 2104, in list_files 2023-11-17T20:41:19.746Z [container-agent] resp = self._request('GET', '/v1/files', query) 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/pebble.py", line 1564, in _request 2023-11-17T20:41:19.746Z [container-agent] response = self._request_raw(method, path, query, headers, data) 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/pebble.py", line 1612, in _request_raw 2023-11-17T20:41:19.746Z [container-agent] raise ConnectionError(e.reason) 2023-11-17T20:41:19.746Z [container-agent] ops.pebble.ConnectionError: [Errno 2] No such file or directory 2023-11-17T20:41:20.037Z [container-agent] 2023-11-17 20:41:20 ERROR juju.worker.uniter.operation runhook.go:180 hook "config-changed" (via hook dispatching script: dispatch) failed: exit status 1 2023-11-17T20:41:20.039Z [container-agent] 2023-11-17 20:41:20 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "config-changed" hook 2023-11-17T20:41:20.379Z [container-agent] 2023-11-17 20:41:20 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "config-changed" hook 2023-11-17T20:41:21.659Z [container-agent] 2023-11-17 20:41:21 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "config-changed" hook 2023-11-17T20:41:25.050Z [container-agent] 2023-11-17 20:41:25 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "config-changed" hook 2023-11-17T20:41:26.123Z [container-agent] 2023-11-17 20:41:26 INFO juju.worker.uniter.operation runhook.go:186 ran "config-changed" hook (via hook dispatching script: dispatch) 2023-11-17T20:41:26.134Z [container-agent] 2023-11-17 20:41:26 INFO juju.worker.uniter resolver.go:165 found queued "start" hook 2023-11-17T20:41:26.859Z [container-agent] 2023-11-17 20:41:26 INFO juju-log Running legacy hooks/start. 2023-11-17T20:41:27.679Z [pebble] Check "readiness" failure 3 (threshold 3): received non-20x status code 418 2023-11-17T20:41:27.679Z [pebble] Check "readiness" failure threshold 3 hit, triggering action 2023-11-17T20:41:27.996Z [container-agent] 2023-11-17 20:41:27 INFO juju.worker.uniter.operation runhook.go:186 ran "start" hook (via hook dispatching script: dispatch) 2023-11-17T20:41:28.768Z [container-agent] 2023-11-17 20:41:28 INFO juju-log Setting up the logrotate configurations 2023-11-17T20:41:36.436Z [container-agent] 2023-11-17 20:41:36 INFO juju-log Configuring instance ```

github runner test failure for same issue

github-actions[bot] commented 9 months ago

https://warthogs.atlassian.net/browse/DPE-3027

taurus-forever commented 8 months ago

The copy of the log: https://pastebin.canonical.com/p/WxDhkGWwkp/ (runner will be cleaned during the holiday).

RafalSiwek commented 8 months ago

I stumbled upon this error when running integration tests in my project

After ssh'ing into the runners and running kubectl logs mysql-k8s -c charm I found the error cause:

Logs snippet: ``` 2024-01-11T16:48:16.462Z [container-agent] 2024-01-11 16:48:16 DEBUG jujuc server.go:222 running hook tool "juju-log" for mysql-k8s/0-config-changed-7837649463538750024 2024-01-11T16:48:16.462Z [container-agent] 2024-01-11 16:48:16 ERROR juju-log Uncaught exception while in charm code: 2024-01-11T16:48:16.462Z [container-agent] Traceback (most recent call last): 2024-01-11T16:48:16.462Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/./src/charm.py", line 786, in 2024-01-11T16:48:16.462Z [container-agent] main(MySQLOperatorCharm) 2024-01-11T16:48:16.462Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/main.py", line 436, in main 2024-01-11T16:48:16.462Z [container-agent] _emit_charm_event(charm, dispatcher.event_name) 2024-01-11T16:48:16.462Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/main.py", line 144, in _emit_charm_event 2024-01-11T16:48:16.462Z [container-agent] event_to_emit.emit(*args, **kwargs) 2024-01-11T16:48:16.462Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/framework.py", line 351, in emit 2024-01-11T16:48:16.462Z [container-agent] framework._emit(event) 2024-01-11T16:48:16.462Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/framework.py", line 853, in _emit 2024-01-11T16:48:16.462Z [container-agent] self._reemit(event_path) 2024-01-11T16:48:16.462Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/framework.py", line 942, in _reemit 2024-01-11T16:48:16.462Z [container-agent] custom_handler(event) 2024-01-11T16:48:16.462Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/./src/charm.py", line 412, in _on_config_changed 2024-01-11T16:48:16.462Z [container-agent] if not self._mysql.is_mysqld_running(): 2024-01-11T16:48:16.462Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/src/mysql_k8s_helpers.py", line 564, in is_mysqld_running 2024-01-11T16:48:16.462Z [container-agent] return self.container.exists(MYSQLD_SOCK_FILE) 2024-01-11T16:48:16.462Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/model.py", line 2547, in exists 2024-01-11T16:48:16.462Z [container-agent] self._pebble.list_files(str(path), itself=True) 2024-01-11T16:48:16.462Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/pebble.py", line 2219, in list_files 2024-01-11T16:48:16.462Z [container-agent] resp = self._request('GET', '/v1/files', query) 2024-01-11T16:48:16.462Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/pebble.py", line 1655, in _request 2024-01-11T16:48:16.462Z [container-agent] response = self._request_raw(method, path, query, headers, data) 2024-01-11T16:48:16.462Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/pebble.py", line 1704, in _request_raw 2024-01-11T16:48:16.462Z [container-agent] raise ConnectionError( 2024-01-11T16:48:16.462Z [container-agent] ops.pebble.ConnectionError: Could not connect to Pebble: socket not found at '/charm/containers/mysql/pebble.socket' (container restarted?) 2024-01-11T16:48:16.712Z [container-agent] 2024-01-11 16:48:16 ERROR juju.worker.uniter.operation runhook.go:180 hook "config-changed" (via hook dispatching script: dispatch) failed: exit status 1 2024-01-11T16:48:16.712Z [container-agent] 2024-01-11 16:48:16 DEBUG juju.machinelock machinelock.go:206 created rotating log file "/var/log/machine-lock.log" with max size 10 MB and max backups 5 2024-01-11T16:48:16.713Z [container-agent] 2024-01-11 16:48:16 DEBUG juju.machinelock machinelock.go:190 machine lock "machine-lock" released for mysql-k8s/0 uniter (run config-changed hook) 2024-01-11T16:48:16.713Z [container-agent] 2024-01-11 16:48:16 DEBUG juju.worker.uniter.operation executor.go:118 lock released for mysql-k8s/0 2024-01-11T16:48:16.714Z [container-agent] 2024-01-11 16:48:16 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "config-changed" hook 2024-01-11T16:48:16.714Z [container-agent] 2024-01-11 16:48:16 DEBUG juju.worker.uniter agent.go:22 [AGENT-STATUS] error: hook failed: "config-changed" 2024-01-11T16:48:19.684Z [container-agent] 2024-01-11 16:48:19 DEBUG juju.worker.uniter pebblepoller.go:99 pebble still starting up on container "mysql": socket "/charm/containers/mysql/pebble.socket" not found 2024-01-11T16:48:21.935Z [container-agent] 2024-01-11 16:48:21 DEBUG juju.worker.uniter.remotestate watcher.go:824 workloadEvent enqueued for mysql-k8s/0: 0 2024-01-11T16:48:21.935Z [container-agent] 2024-01-11 16:48:21 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "config-changed" hook 2024-01-11T16:48:23.528Z [container-agent] 2024-01-11 16:48:23 DEBUG juju.worker.uniter.remotestate watcher.go:799 storage attachment change for mysql-k8s/0: {storage-database-0 {2 alive true /var/lib/juju/storage/database/0}} 2024-01-11T16:48:23.528Z [container-agent] 2024-01-11 16:48:23 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "config-changed" hook 2024-01-11T16:48:24.502Z [pebble] Check "readiness" failure 2 (threshold 3): received non-20x status code 418 2024-01-11T16:48:34.506Z [pebble] Check "readiness" failure 3 (threshold 3): received non-20x status code 418 2024-01-11T16:48:34.506Z [pebble] Check "readiness" failure threshold 3 hit, triggering action 2024-01-11T16:48:39.993Z [container-agent] 2024-01-11 16:48:39 DEBUG juju.worker.uniter.remotestate watcher.go:799 storage attachment change for mysql-k8s/0: {storage-database-0 {2 alive true /var/lib/juju/storage/database/0}} 2024-01-11T16:48:39.993Z [container-agent] 2024-01-11 16:48:39 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "config-changed" hook 2024-01-11T16:48:44.503Z [pebble] Check "readiness" failure 4 (threshold 3): received non-20x status code 418 2024-01-11T16:48:54.502Z [pebble] Check "readiness" failure 5 (threshold 3): received non-20x status code 418 2024-01-11T16:48:55.003Z [container-agent] 2024-01-11 16:48:55 DEBUG juju.worker.uniter.remotestate watcher.go:799 storage attachment change for mysql-k8s/0: {storage-database-0 {2 alive true /var/lib/juju/storage/database/0}} 2024-01-11T16:48:55.003Z [container-agent] 2024-01-11 16:48:55 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "config-changed" hook ```

It looked like the config-changed hook failed because the pebble operator could not find the socket file for the mysql container, and according to the logs, after the hook failed, it didn't auto-retry and awaited error resolution. The same error was posted in this issue.

When I ssh'd into the mysql-k8s-x pod containers I was able to find the /charm/containers/mysql/pebble.socket file in the charm container and /charm/container/pebble.socket in the mysql container and after running juju resolved mysql-k8s/0 the charm went into Active state. After running the experiments and observing the runner resource metrics I began to suspect that the reason for this behaviour may be the limited resources of agents and, consequently, the problem with the availability of sockets.

To keep my integration tests working, for my temporary solution I implemented a not-so-elegant solution for charm re-deployment when hitting this error which solved my issue. Still, I think it would be better to implement a solution for retrying the container connection in the source - so I am testing this.

paulomach commented 7 months ago

Hi @RafalSiwek , what you are seeing is a symptom of the mysql container being OOMKilled, hence the pebble connection being lost.

I was checking your tests and saw that your bundle does not set any options for mysql-k8s charm.

The default behavior for memory allocation is somewhat grabby, which for testing on GH runners will most probably lead to unstable outcome.

So, my advice is either to have a dedicated test_bundle.yaml or setup the bundle configuration in the test, adding value to the profile option of the mysql-k8s charm, e.g.:

@@ -8,6 +8,8 @@
     scale: 1
     trust: true
     _github_repo_name: mysql-k8s-operator
+    options:
+      - profile: testing
   zenml-server:
     charm: zenml-server
     channel: latest/edge

The testing profile is very light weight, tuned for this use case

paulomach commented 7 months ago

And by the way, it's encouraged to set limits using either k8s constraints or profiles on production deployments.

RafalSiwek commented 7 months ago

Hi @paulomach, thanks for the suggestions!

As the issue persisted, I decided to conduct more tests, including setting the profile to "testing." Unfortunately, the problem persisted (you can review my testing path), resulting in the mysql container running without any recorded kills or restarts. Meanwhile, the charm is stuck in the config-changed hook error state, awaiting manual intervention.

Now, with evidence suggesting that the Pebble socket connection error is causing the charm to enter this hook error limbo state when calling the is_mysqld_running method, I'm considering whether incorporating error handling, timeouts, or a retry policy—especially for the testing profile—might be beneficial.

paulomach commented 7 months ago

I believe this is due a config-changed hook being handled in between a leader-elected and pebble-ready events. I've proposed a change on @RafalSiwek PR#356 to address that.

orfeas-k commented 7 months ago

Note that it keeps failing consistently for us (kubeflow team) as well. You can see all 3 integration tests that deploy mysql-k8s have failed with the same error while we have constraints set.

paulomach commented 7 months ago

Note that it keeps failing consistently for us (kubeflow team) as well. You can see all 3 integration tests that deploy mysql-k8s have failed with the same error while we have constraints set.

Hi @orfeas-k , I believe we got to the root cause now and tests are running for the PR.

orfeas-k commented 7 months ago

Great, thanks for the update. Let us know when the fix is published!

paulomach commented 7 months ago

@orfeas-k @ca-scribner @RafalSiwek fix published on 8.0/edge (rev119)

Please give it a try to confirm and close this issue

RafalSiwek commented 7 months ago

I tested the update on 4 concurrent runs (#1, #2, #3, #4), and all of them succeeded on the first try. It seems that this change has resolved the issue on my end.

orfeas-k commented 7 months ago

Tested and works! Thank you for implementing this. It would be great if you could let us know when this fix is published to stable so we can revert the changes we implemented.

P.S. @paulomach I don't seem to have permission to close this issue.

paulomach commented 7 months ago

Tested and works! Thank you for implementing this. It would be great if you could let us know when this fix is published to stable so we can revert the changes we implemented.

P.S. @paulomach I don't seem to have permission to close this issue.

Thank you. We will let you know when promoted to stable.