Closed javacruft closed 9 months ago
Waiting for https://github.com/canonical/data-platform-libs/issues/108 before investigating
It looks like the connection to MySQL server was quite unreliable
My interpretation of the logs for the failed unit:
2023-11-06T19:52:22.166Z [container-agent] 2023-11-06 19:52:22 ERROR juju-log backend-database:159: Failed to run logged_commands=["shell.connect('relation-159:***@heat-mysql-primary.openstack.svc.cluster.local:3306')", 'result = session.run_sql("SELECT USER, ATTRIBUTE->>\'$.router_id\' FROM INFORMATION_SCHEMA.USER_ATTRIBUTES WHERE ATTRIBUTE->\'$.created_by_user\'=\'relation-159\' AND ATTRIBUTE->\'$.created_by_juju_unit\'=\'heat-cfn-mysql-router/0\'")', 'print(result.fetch_all())']
2023-11-06T19:52:22.166Z [container-agent] stderr:
2023-11-06T19:52:22.166Z [container-agent] Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
2023-11-06T19:52:22.166Z [container-agent] Traceback (most recent call last):
2023-11-06T19:52:22.166Z [container-agent] File "<string>", line 1, in <module>
2023-11-06T19:52:22.166Z [container-agent] mysqlsh.DBError: MySQL Error (2003): Shell.connect: Can't connect to MySQL server on 'heat-mysql-primary.openstack.svc.cluster.local:3306' (111)
Router fails here when checking if an old router user+metadata needs to be cleaned up: https://github.com/canonical/mysql-router-k8s-operator/blob/1704b4e190e394cfaba6b68b06debd7ae2b9a606/src/workload.py#L230
2023-11-06T19:53:02.306Z [container-agent] 2023-11-06 19:53:02 ERROR juju-log backend-database:159: Failed to bootstrap router
2023-11-06T19:53:02.306Z [container-agent] logged_command=['--bootstrap', 'relation-159:***@heat-mysql-primary.openstack.svc.cluster.local:3306', '--strict', '--conf-set-option', 'http_server.bind_address=127.0.0.1', '--conf-use-gr-notifications']
2023-11-06T19:53:02.306Z [container-agent] stderr:
2023-11-06T19:53:02.306Z [container-agent] Error: The provided server is currently not in a InnoDB cluster group with quorum and thus may contain inaccurate or outdated data.
Router has succeeded in cleaning up the old router user & metadata (since it's failing a line later): https://github.com/canonical/mysql-router-k8s-operator/blob/1704b4e190e394cfaba6b68b06debd7ae2b9a606/src/workload.py#L231
2023-11-06T19:54:20.935Z [container-agent] 2023-11-06 19:54:20 ERROR juju-log backend-database:159: Failed to bootstrap router
2023-11-06T19:54:20.935Z [container-agent] logged_command=['--bootstrap', 'relation-159:***@heat-mysql-primary.openstack.svc.cluster.local:3306', '--strict', '--conf-set-option', 'http_server.bind_address=127.0.0.1', '--conf-use-gr-notifications']
2023-11-06T19:54:20.935Z [container-agent] stderr:
2023-11-06T19:54:20.935Z [container-agent] Error: It appears that a router instance named 'system' has been previously configured in this host. If that instance no longer exists, use the --force option to overwrite it.
While MySQL server was recovering, I'm guessing it overrode/reverted the changes mysql-router made to the router metadata (but not the router user)
I believe the cause of this issue is the same as here: https://github.com/canonical/mysql-k8s-operator/issues/260#issuecomment-1674717593
MySQL server is providing connection information to MySQL Router when it is not ready to serve traffic (i.e. not in a quorum)
Router, when it sees the connection information, assumes that the cluster is available and that any operators router performs will be persisted. Router deletes the router user & router cluster metadata, assuming that if one of those changes goes through, both changes will go through (it deletes the user after the metadata as a safe guard). However, during server's recovery process, the user deletion goes through but the metadata deletion is reverted—causing router to fail to bootstrap
Steps to reproduce
Failed multi-node test run from Canonical Solutions QA team.
Multi-node microstack deployment on baremetal with deployment in many-mysql mode - mysql per service.
Majority of mysql apps deploy and scale correctly however on mysql-router-k8s instance failed to bootstrap.
Expected behavior
All mysql-router-k8s units bootstrap correctly.
Actual behavior
Failure of single mysql-router-k8s unit.
Versions
Operating system: 22.04
Juju CLI: 3.2.3 Juju agent: 3.2.3
mysql-k8s charm revision: 99 mysql-router-k8s charm revision: 69
microk8s: 1.26-strict/stable
Log output
Logs from the failed deployment are linked from:
https://bugs.launchpad.net/snap-openstack/+bug/2042906
direct link:
https://oil-jenkins.canonical.com/artifacts/628e5903-4772-4a3e-9b0a-80cc04d3c6d3/index.html
Additional context
https://bugs.launchpad.net/snap-openstack/+bug/2042906