Open peterctl opened 2 months ago
Thanks for the detailed report @peterctl , it will be useful to create a test. We will queue it for fix.
Facing this bug too
for documentation: if the pod is killed while mysql-pebble-ready
is running for the first time, we see the following error trace. i think that there is more we can do to gracefully recover, but still gathering more info for the below encountered error
unit-mysql-k8s4-0: 21:44:26 INFO juju.downloader download verified ("ch:amd64/jammy/mysql-k8s-201") [330/1099]
unit-mysql-k8s4-0: 21:44:35 INFO juju.worker.uniter hooks are retried true
unit-mysql-k8s4-0: 21:44:35 INFO juju.worker.uniter found queued "install" hook
unit-mysql-k8s4-0: 21:44:37 INFO unit.mysql-k8s4/0.juju-log Running legacy hooks/install.
unit-mysql-k8s4-0: 21:44:38 INFO juju.worker.uniter.operation ran "install" hook (via hook dispatching script: dispatch)
unit-mysql-k8s4-0: 21:44:39 INFO juju.worker.uniter.operation ran "upgrade-relation-created" hook (via hook dispatching script: dispatch)
unit-mysql-k8s4-0: 21:44:40 INFO juju.worker.uniter.operation ran "restart-relation-created" hook (via hook dispatching script: dispatch)
unit-mysql-k8s4-0: 21:44:41 INFO juju.worker.uniter.operation ran "database-peers-relation-created" hook (via hook dispatching script: dispatch)
unit-mysql-k8s4-0: 21:44:41 INFO juju.worker.uniter found queued "leader-elected" hook
unit-mysql-k8s4-0: 21:44:43 INFO juju.worker.uniter.operation ran "leader-elected" hook (via hook dispatching script: dispatch)
unit-mysql-k8s4-0: 21:44:43 INFO unit.mysql-k8s4/0.juju-log Setting up the logrotate configurations
unit-mysql-k8s4-0: 21:44:54 INFO unit.mysql-k8s4/0.juju-log Configuring instance
unit-mysql-k8s4-0: 21:44:55 INFO unit.mysql-k8s4/0.juju-log Installing plugin='audit_log'
unit-mysql-k8s4-0: 21:44:55 INFO unit.mysql-k8s4/0.juju-log Installing plugin='audit_log_filter'
unit-mysql-k8s4-0: 21:45:06 WARNING unit.mysql-k8s4/0.juju-log Failed to check if cluster metadata exists from_instance='mysql-k8s4-0.mysql-k8s4-endpoints.database.svc.cluster.local'
unit-mysql-k8s4-0: 21:45:06 INFO unit.mysql-k8s4/0.juju-log Creating cluster cluster-7d7613ead7f7225f76cd0d6003b5c3d5
unit-mysql-k8s4-0: 21:45:09 INFO juju.worker.caasunitterminationworker terminating due to SIGTERM
unit-mysql-k8s4-0: 21:45:09 WARNING juju.worker.uniter.operation hook "mysql-pebble-ready" was terminated
unit-mysql-k8s4-0: 21:45:09 ERROR juju.worker.uniter pebble poll failed for container "mysql": failed to send pebble-ready event: terminated
unit-mysql-k8s4-0: 21:45:10 INFO unit.mysql-k8s4/0.juju-log Setting up the logrotate configurations
unit-mysql-k8s4-0: 21:45:10 INFO unit.mysql-k8s4/0.juju-log Reconciling the pebble layer
unit-mysql-k8s4-0: 21:45:10 ERROR unit.mysql-k8s4/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/./src/charm.py", line 928, in <module>
main(MySQLOperatorCharm)
File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/venv/ops/main.py", line 551, in main
manager.run()
File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/venv/ops/main.py", line 530, in run
self._emit()
File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/venv/ops/main.py", line 519, in _emit
_emit_charm_event(self.charm, self.dispatcher.event_name)
File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/venv/ops/main.py", line 147, in _emit_charm_event
event_to_emit.emit(*args, **kwargs)
File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/venv/ops/framework.py", line 348, in emit
framework._emit(event)
File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/venv/ops/framework.py", line 860, in _emit
self._reemit(event_path)
File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/venv/ops/framework.py", line 950, in _reemit
custom_handler(event)
File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 735, in wrapped_function
return callable(*args, **kwargs) # type: ignore
File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/./src/charm.py", line 700, in _on_mysql_pebble_ready
self._reconcile_pebble_layer(container)
File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 735, in wrapped_function
return callable(*args, **kwargs) # type: ignore
File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/./src/charm.py", line 412, in _reconcile_pebble_layer
container.replan()
File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/venv/ops/model.py", line 2259, in replan
self._pebble.replan_services()
File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/venv/ops/pebble.py", line 2129, in replan_services
return self._services_action('replan', [], timeout, delay)
File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/venv/ops/pebble.py", line 2226, in _services_action
raise ChangeError(change.err, change)
ops.pebble.ChangeError: cannot perform the following tasks:
- Start service "mysqld_safe" (cannot start service while terminating)
----- Logs from task 0 -----
2024-09-26T21:45:10Z ERROR cannot start service while terminating
-----
unit-mysql-k8s4-0: 21:45:10 ERROR juju.worker.uniter.operation hook "mysql-pebble-ready" (via hook dispatching script: dispatch) failed: exit status 1
unit-mysql-k8s4-0: 21:45:10 ERROR juju.worker.uniter pebble poll failed for container "mysql": failed to send pebble-ready event: hook failed
unit-mysql-k8s4-0: 21:45:12 INFO juju.worker.uniter.operation ran "database-storage-attached" hook (via hook dispatching script: dispatch)
unit-mysql-k8s4-0: 21:45:12 INFO juju.worker.uniter unit "mysql-k8s4/0" shutting down: agent should be terminated
unit-mysql-k8s4-0: 21:45:28 INFO juju.cmd running containerAgent [3.5.3 63d460f9ee6c7c710131961390687e7a0ab90470 gc go1.21.12]
unit-mysql-k8s4-0: 21:45:28 INFO juju.cmd.containeragent.unit start "unit"
unit-mysql-k8s4-0: 21:45:28 INFO juju.worker.upgradesteps upgrade steps for 3.5.3 have already been run.
unit-mysql-k8s4-0: 21:45:28 INFO juju.worker.probehttpserver starting http server on 127.0.0.1:65301
unit-mysql-k8s4-0: 21:45:28 INFO juju.api cannot resolve "controller-service.controller-uk8s-3-5-3.svc.cluster.local": lookup controller-service.controller-uk8s-3-5-3.svc.cluster.local: operation was canc
eled
unit-mysql-k8s4-0: 21:45:28 INFO juju.api connection established to "wss://10.152.183.30:17070/model/31defe4d-ef81-431c-8d1c-376ecd60e1e7/api"
unit-mysql-k8s4-0: 21:45:28 INFO juju.worker.apicaller [31defe] "unit-mysql-k8s4-0" successfully connected to "10.152.183.30:17070"
unit-mysql-k8s4-0: 21:45:28 INFO juju.api connection established to "wss://controller-service.controller-uk8s-3-5-3.svc.cluster.local:17070/model/31defe4d-ef81-431c-8d1c-376ecd60e1e7/api"
unit-mysql-k8s4-0: 21:45:28 INFO juju.worker.apicaller [31defe] "unit-mysql-k8s4-0" successfully connected to "controller-service.controller-uk8s-3-5-3.svc.cluster.local:17070"
unit-mysql-k8s4-0: 21:45:28 INFO juju.worker.migrationminion migration migration phase is now: NONE
unit-mysql-k8s4-0: 21:45:28 INFO juju.worker.logger logger worker started
unit-mysql-k8s4-0: 21:45:28 INFO juju.worker.leadership mysql-k8s4/0 promoted to leadership of mysql-k8s4
unit-mysql-k8s4-0: 21:45:28 WARNING juju.worker.proxyupdater unable to set snap core settings [proxy.http= proxy.https= proxy.store=]: exec: "snap": executable file not found in $PATH, output: ""
unit-mysql-k8s4-0: 21:45:28 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-mysql-k8s4-0
unit-mysql-k8s4-0: 21:45:28 INFO juju.worker.caasupgrader abort check blocked until version event received
unit-mysql-k8s4-0: 21:45:28 INFO juju.worker.caasupgrader unblocking abort check
unit-mysql-k8s4-0: 21:45:28 INFO juju.worker.uniter unit "mysql-k8s4/0" started
unit-mysql-k8s4-0: 21:45:28 INFO juju.worker.uniter hooks are retried true
unit-mysql-k8s4-0: 21:45:28 INFO juju.worker.uniter.charm downloading ch:amd64/jammy/mysql-k8s-201 from API server
unit-mysql-k8s4-0: 21:45:28 INFO juju.downloader downloading from ch:amd64/jammy/mysql-k8s-201
unit-mysql-k8s4-0: 21:45:29 INFO juju.downloader download complete ("ch:amd64/jammy/mysql-k8s-201")
unit-mysql-k8s4-0: 21:45:29 INFO juju.downloader download verified ("ch:amd64/jammy/mysql-k8s-201")
unit-mysql-k8s4-0: 21:45:38 INFO juju.worker.uniter found queued "upgrade-charm" hook
unit-mysql-k8s4-0: 21:45:40 INFO unit.mysql-k8s4/0.juju-log Running legacy hooks/upgrade-charm.
unit-mysql-k8s4-0: 21:45:40 ERROR unit.mysql-k8s4/0.juju-log Cluster upgrade failed, ensure pre-upgrade checks are ran first.
unit-mysql-k8s4-0: 21:45:41 INFO juju.worker.uniter.operation ran "upgrade-charm" hook (via hook dispatching script: dispatch)
unit-mysql-k8s4-0: 21:45:41 INFO juju.worker.uniter found queued "config-changed" hook
unit-mysql-k8s4-0: 21:45:42 INFO juju.worker.uniter.operation ran "config-changed" hook (via hook dispatching script: dispatch)
unit-mysql-k8s4-0: 21:45:42 INFO juju.worker.uniter found queued "start" hook
unit-mysql-k8s4-0: 21:45:43 INFO unit.mysql-k8s4/0.juju-log Running legacy hooks/start.
unit-mysql-k8s4-0: 21:45:44 INFO juju.worker.uniter.operation ran "start" hook (via hook dispatching script: dispatch)
unit-mysql-k8s4-0: 21:45:45 INFO unit.mysql-k8s4/0.juju-log Setting up the logrotate configurations
unit-mysql-k8s4-0: 21:45:45 INFO unit.mysql-k8s4/0.juju-log Reconciling the pebble layer
unit-mysql-k8s4-0: 21:45:49 INFO juju.worker.uniter.operation ran "mysql-pebble-ready" hook (via hook dispatching script: dispatch)
unit-mysql-k8s4-0: 21:45:51 INFO juju.worker.uniter.operation ran "upgrade-relation-changed" hook (via hook dispatching script: dispatch)
unit-mysql-k8s4-0: 21:45:52 INFO juju.worker.uniter.operation ran "database-peers-relation-changed" hook (via hook dispatching script: dispatch)
unit-mysql-k8s4-0: 21:49:41 ERROR unit.mysql-k8s4/0.juju-log Failed to get cluster status for cluster-7d7613ead7f7225f76cd0d6003b5c3d5
unit-mysql-k8s4-0: 21:49:41 ERROR unit.mysql-k8s4/0.juju-log Failed to get cluster endpoints
Traceback (most recent call last):
unit-mysql-k8s4-0: 21:49:41 ERROR unit.mysql-k8s4/0.juju-log Failed to get cluster endpoints
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/src/mysql_k8s_helpers.py", line 792, in update_endpoints
rw_endpoints, ro_endpoints, offline = self.get_cluster_endpoints(get_ips=False)
File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 735, in wrapped_function
return callable(*args, **kwargs) # type: ignore
File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/lib/charms/mysql/v0/mysql.py", line 1932, in get_cluster_endpoints
raise MySQLGetClusterEndpointsError("Failed to get endpoints from cluster status")
charms.mysql.v0.mysql.MySQLGetClusterEndpointsError: Failed to get endpoints from cluster status
unit-mysql-k8s4-0: 21:49:42 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-mysql-k8s4-0: 21:54:05 ERROR unit.mysql-k8s4/0.juju-log Failed to get cluster status for cluster-7d7613ead7f7225f76cd0d6003b5c3d5
unit-mysql-k8s4-0: 21:54:05 ERROR unit.mysql-k8s4/0.juju-log Failed to get cluster endpoints
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/src/mysql_k8s_helpers.py", line 792, in update_endpoints
rw_endpoints, ro_endpoints, offline = self.get_cluster_endpoints(get_ips=False)
File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 735, in wrapped_function
return callable(*args, **kwargs) # type: ignore
File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/lib/charms/mysql/v0/mysql.py", line 1932, in get_cluster_endpoints
raise MySQLGetClusterEndpointsError("Failed to get endpoints from cluster status")
charms.mysql.v0.mysql.MySQLGetClusterEndpointsError: Failed to get endpoints from cluster status
unit-mysql-k8s4-0: 21:54:05 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
@peterctl i have tried, unsuccessfully, to replicate the issue locally (with rev 153 of mysql-k8s). i deployed a cluster of 3 nodes and tried at least 15 times to delete all pods in the cluster without facing the issue. Furthermore, juju machines
is not valid for a k8s model. Would you be able to share more about the environment you faced this issue on?
Also, if you run into the above issue again, please collect the logs if you are able (in particular the debug logs, pebble logs, and mysqld logs in /var/log/mysql/error.log
) - the logs will help us better determine what is happening.
@shayancanonical as I meantioned we're also facing this bug, it's in the ongoing deployment and the workaround doesn't help. I would like to raise a field-critical kubeflow-bundle.yaml.txt var-log-mysql-error-log.txt pebble-logs.txt mlflow-mysql-pod-log.txt
In my case the charms are restarted/failed after initial redeployment in the next day or 2
@natalytvinova can you please share your Juju version? Thank you!
Just got internal reply. It should be Juju 3.5.1. This Juju version is affected by random units restart (until they cannot re-join the cluster any-longer), see: https://bugs.launchpad.net/juju/+bug/2075139 It is recommended to stay with Juju 3.4 in K8s production for now.
Killing multiple MySQL pods at the same time without waiting for them to fully come online will put the cluster in an unhealthy state, but it will not trigger the reboot cluster from complete outage flow to recover the cluster. This will also happen during a networking outage, which in this case was simulated by taking down all NICs on all the microk8s machines in a single AZ.
Steps to reproduce
b. Wait 15 minutes for the network outage to affect mysql, then reboot all the machines in the AZ to bring the network back online.
Expected behavior
The cluster will go offline and the reboot cluster from complete outage flow will be triggered to recover the cluster.
Actual behavior
The leader unit will not go offline and the reboot cluster flow will not be triggered, leaving the cluster in an inconsistent state.
Versions
Operating system: Ubuntu 22.04 Jammy Jellyfish Juju CLI: 3.4.3 Juju agent: 3.4.3 Charm revision: 153 (channel 8.0/stable) microk8s: 1.28
Additional context
To recover the cluster, the
mysqld_safe
Pebble service needs to be restarted inside the leader unit: