Open carlcsaposs-canonical opened 3 weeks ago
Not sure if the issue is in mysql-router-k8s charm, mysql-k8s charm, or elsewhere
The issue seems to have happened after mysql was offline based on these logs
unit-nova-mysql-router-0: 22:27:31 ERROR unit.nova-mysql-router/0.juju-log upgrade-version-a:44: Failed to run MySQL Shell script:
import json
import mysqlsh
import traceback
try:
shell.connect("relation-75:***@nova-mysql-primary.openstack.svc.cluster.local:3306")
import json
cluster_set = dba.get_cluster_set()
routers = cluster_set.list_routers()
# mysqlsh objects are weird—they quack (i.e. duck typing) like standard Python objects (e.g. list,
# dict), but do not serialize to JSON correctly.
# Cast to str & load from JSON str before serializing
routers = json.loads(str(routers))
with open("/tmp/mysqlsh_output.json", "w") as file:
json.dump(routers, file)
except mysqlsh.DBError as exception:
error = {
"message": str(exception),
"code": exception.code,
"traceback_message": "".join(traceback.format_exception(exception)),
}
else:
error = None
with open("/tmp/mysqlsh_error.json", "w") as file:
json.dump(error, file)
MySQL client error 2013
MySQL Shell traceback:
Traceback (most recent call last):
File "<string>", line 6, in <module>
mysqlsh.DBError: MySQL Error (2013): Shell.connect: Lost connection to MySQL server at 'waiting for initial communication packet', system error: 110
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-nova-mysql-router-0/charm/src/mysql_shell/__init__.py", line 108, in _run_code
raise ShellDBError(**exception)
mysql_shell.ShellDBError: MySQL Error (2013): Shell.connect: Lost connection to MySQL server at 'waiting for initial communication packet', system error: 110
unit-nova-mysql-router-0: 22:27:31 ERROR unit.nova-mysql-router/0.juju-log upgrade-version-a:44: Uncaught exception while in charm code:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-nova-mysql-router-0/charm/./src/kubernetes_charm.py", line 185, in <module>
ops.main.main(KubernetesRouterCharm)
File "/var/lib/juju/agents/unit-nova-mysql-router-0/charm/venv/ops/main.py", line 441, in main
_emit_charm_event(charm, dispatcher.event_name)
File "/var/lib/juju/agents/unit-nova-mysql-router-0/charm/venv/ops/main.py", line 149, in _emit_charm_event
event_to_emit.emit(*args, **kwargs)
File "/var/lib/juju/agents/unit-nova-mysql-router-0/charm/venv/ops/framework.py", line 344, in emit
framework._emit(event)
File "/var/lib/juju/agents/unit-nova-mysql-router-0/charm/venv/ops/framework.py", line 841, in _emit
self._reemit(event_path)
File "/var/lib/juju/agents/unit-nova-mysql-router-0/charm/venv/ops/framework.py", line 930, in _reemit
custom_handler(event)
File "/var/lib/juju/agents/unit-nova-mysql-router-0/charm/src/abstract_charm.py", line 261, in reconcile
if not workload_.status or workload_.status == ops.WaitingStatus():
File "/var/lib/juju/agents/unit-nova-mysql-router-0/charm/src/workload.py", line 297, in status
if not self.shell.is_router_in_cluster_set(self._router_id):
File "/var/lib/juju/agents/unit-nova-mysql-router-0/charm/src/mysql_shell/__init__.py", line 229, in is_router_in_cluster_set
self._run_code(
File "/var/lib/juju/agents/unit-nova-mysql-router-0/charm/src/mysql_shell/__init__.py", line 108, in _run_code
raise ShellDBError(**exception)
mysql_shell.ShellDBError: MySQL Error (2013): Shell.connect: Lost connection to MySQL server at 'waiting for initial communication packet', system error: 110
unit-nova-mysql-router-0: 22:27:31 ERROR juju.worker.uniter.operation hook "upgrade-version-a-relation-changed" (via hook dispatching script: dispatch) failed: exit status 1
unit-nova-mysql-router-0: 22:27:31 INFO juju.worker.uniter awaiting error resolution for "relation-changed" hook
unit-nova-mysql-router-0: 22:27:36 INFO juju.worker.uniter awaiting error resolution for "relation-changed" hook
unit-nova-mysql-2: 22:27:48 ERROR unit.nova-mysql/2.juju-log database-peers:24: Failed to get cluster status for cluster-df07980f3ce9324ddf8ccee565b8eb7c
unit-nova-mysql-2: 22:27:48 ERROR unit.nova-mysql/2.juju-log database-peers:24: Failed to get cluster endpoints
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-nova-mysql-2/charm/src/mysql_k8s_helpers.py", line 836, in update_endpoints
rw_endpoints, ro_endpoints, offline = self.get_cluster_endpoints(get_ips=False)
File "/var/lib/juju/agents/unit-nova-mysql-2/charm/lib/charms/mysql/v0/mysql.py", line 1469, in get_cluster_endpoints
raise MySQLGetClusterEndpointsError("Failed to get endpoints from cluster status")
charms.mysql.v0.mysql.MySQLGetClusterEndpointsError: Failed to get endpoints from cluster status
unit-nova-mysql-router-2: 22:27:49 ERROR unit.nova-mysql-router/2.juju-log backend-database:75: Uncaught exception while in charm code:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-nova-mysql-router-2/charm/./src/kubernetes_charm.py", line 185, in <module>
ops.main.main(KubernetesRouterCharm)
File "/var/lib/juju/agents/unit-nova-mysql-router-2/charm/venv/ops/main.py", line 441, in main
_emit_charm_event(charm, dispatcher.event_name)
File "/var/lib/juju/agents/unit-nova-mysql-router-2/charm/venv/ops/main.py", line 149, in _emit_charm_event
event_to_emit.emit(*args, **kwargs)
File "/var/lib/juju/agents/unit-nova-mysql-router-2/charm/venv/ops/framework.py", line 344, in emit
framework._emit(event)
File "/var/lib/juju/agents/unit-nova-mysql-router-2/charm/venv/ops/framework.py", line 841, in _emit
self._reemit(event_path)
File "/var/lib/juju/agents/unit-nova-mysql-router-2/charm/venv/ops/framework.py", line 930, in _reemit
custom_handler(event)
File "/var/lib/juju/agents/unit-nova-mysql-router-2/charm/lib/charms/data_platform_libs/v0/data_interfaces.py", line 1958, in _on_relation_changed_event
getattr(self.on, "database_created").emit(
File "/var/lib/juju/agents/unit-nova-mysql-router-2/charm/venv/ops/framework.py", line 344, in emit
framework._emit(event)
File "/var/lib/juju/agents/unit-nova-mysql-router-2/charm/venv/ops/framework.py", line 841, in _emit
self._reemit(event_path)
File "/var/lib/juju/agents/unit-nova-mysql-router-2/charm/venv/ops/framework.py", line 930, in _reemit
custom_handler(event)
File "/var/lib/juju/agents/unit-nova-mysql-router-2/charm/src/abstract_charm.py", line 256, in reconcile
workload_.enable(tls=self._tls_certificate_saved, unit_name=self.unit.name)
File "/var/lib/juju/agents/unit-nova-mysql-router-2/charm/src/workload.py", line 260, in enable
self._bootstrap_router(tls=tls)
File "/var/lib/juju/agents/unit-nova-mysql-router-2/charm/src/workload.py", line 203, in _bootstrap_router
self._container.run_mysql_router(command, timeout=30)
File "/var/lib/juju/agents/unit-nova-mysql-router-2/charm/src/container.py", line 134, in run_mysql_router
return self._run_command(args, timeout=timeout)
File "/var/lib/juju/agents/unit-nova-mysql-router-2/charm/src/rock.py", line 162, in _run_command
output, _ = process.wait_output()
File "/var/lib/juju/agents/unit-nova-mysql-router-2/charm/venv/ops/pebble.py", line 1354, in wait_output
exit_code: int = self._wait()
File "/var/lib/juju/agents/unit-nova-mysql-router-2/charm/venv/ops/pebble.py", line 1316, in _wait
raise ChangeError(change.err, change)
ops.pebble.ChangeError: cannot perform the following tasks:
- exec command "mysqlrouter" (timed out after 30s: context deadline exceeded)
----- Logs from task 0 -----
2024-06-07T22:27:49Z ERROR timed out after 30s: context deadline exceeded
-----
unit-nova-mysql-2: 22:27:49 INFO juju.worker.uniter.operation ran "database-peers-relation-changed" hook (via hook dispatching script: dispatch)
wondering if the cluster router metadata got mangled while cluster was recovering from this outage or during scale up
MySQL router shows no errors during bootstrap, so it should've been added to the cluster metadata at that point. Since we only have info-level logs, unclear how much time elapsed after bootstrap and before health check (of router in cluster metadata) failed
@gboutry are you able to get steps to reproduce or provide more insight on possible ways to reproduce this?
From the original bug report, it appears that a scale up happened—do you know what was scaled up & how we can perform the same scale up while trying to reproduce?
Potentially related to https://github.com/canonical/mysql-router-operator/issues/118, but seems unlikely
@carlcsaposs-canonical We hit it again in SQA while testing Sunbeam. Here are logs from all openstack units https://oil-jenkins.canonical.com/artifacts/858c5749-bd1a-425a-b3bb-6c35b32a3d30/generated/generated/sunbeam/pods_openstack_logs.tgz
barbican-mysql-router/0 active idle 10.1.109.163
barbican-mysql-router/1* active idle 10.1.210.98
barbican-mysql-router/2 blocked idle 10.1.172.226 Router was manually removed from MySQL ClusterSet. Remove & re-deploy unit
We hit it twice in tens of tests so I cannot offer steps to reproduce it.
logs from barbican-mysql-router/2 show
2024-06-26T01:15:48.751Z [container-agent] 2024-06-26 01:15:48 ERROR juju-log backend-database:151: Failed to bootstrap router
2024-06-26T01:15:48.751Z [container-agent] logged_command=['--bootstrap', 'relation-151:***@barbican-mysql-primary:3306', '--strict', '--force', '--conf-set-option', 'http_server.bind_address=127.0.0.1', '--conf-use-gr-notifications']
2024-06-26T01:15:48.751Z [container-agent] stderr:
2024-06-26T01:15:48.751Z [container-agent] Error: The provided server is currently not in a InnoDB cluster group with quorum and thus may contain inaccurate or outdated data.
2024-06-26T01:15:48.751Z [container-agent]
2024-06-26T01:15:48.751Z [container-agent]
2024-06-26T01:15:48.754Z [container-agent] 2024-06-26 01:15:48 ERROR juju-log backend-database:151: MySQL Server does not have quorum. Will retry next Juju event
cc @paulomach @shayancanonical
from the logs, it looks like it might be a different issue
We hit it twice in tens of tests so I cannot offer steps to reproduce it.
@marosg42 do you have the logs from the other instance it occurred? (or are you referring to original issue?) if so, I'd be curious to look to see if we see the same logs in both instances
@carlcsaposs-canonical yes, I meant the original issue reported on Launchpad here are logs https://oil-jenkins.canonical.com/artifacts/b0040961-eb15-40f3-a89b-e49774c6f5d4/generated/generated/sunbeam/pods_openstack_logs.tgz
In the first case it was
nova-mysql-router/0* active idle 10.1.239.98
nova-mysql-router/1 active idle 10.1.109.143
nova-mysql-router/2 blocked idle 10.1.210.80 Router was manually removed from MySQL ClusterSet. Remove & re-deploy unit
got it, thanks!
@paulomach @shayancanonical FYI, at first glance, these seem like two separate issues (seems likely to be with mysql-k8s charm, especially in 2nd case—but not sure)
@carlcsaposs-canonical I'm investigating next pulse. From feeling, I thinkg there's some cases where the router can rebootstrap instead of entering blocked state.
Steps to reproduce
Unavailable https://bugs.launchpad.net/snap-openstack/+bug/2068924
Appears to be scale up of mysql-k8s & mysql-router-k8s
Expected behavior
All units active/idle
Actual behavior
nova-mysql-router/2 in BlockedStatus
Router was manually removed from MySQL ClusterSet. Remove & re-deploy unit
Versions
Juju agent: 3.4.3
mysql-k8s charm revision: 127 mysql-router-k8s charm revision: 96
Log output
https://oil-jenkins.canonical.com/artifacts/b0040961-eb15-40f3-a89b-e49774c6f5d4/index.html
Juju debug log: debug_log_openstack.txt juju_status_openstack.txt
Additional context
Initially reported to Data Platform by @gboutry