canonical / mysql-k8s-operator

A Charmed Operator for running MySQL on Kubernetes
https://charmhub.io/mysql-k8s
Apache License 2.0
8 stars 15 forks source link

mysql-router-k8s: bootstrap failure #345

Closed javacruft closed 9 months ago

javacruft commented 11 months ago

Steps to reproduce

Failed multi-node test run from Canonical Solutions QA team.

Multi-node microstack deployment on baremetal with deployment in many-mysql mode - mysql per service.

Majority of mysql apps deploy and scale correctly however on mysql-router-k8s instance failed to bootstrap.

Expected behavior

All mysql-router-k8s units bootstrap correctly.

Actual behavior

Failure of single mysql-router-k8s unit.

Versions

Operating system: 22.04

Juju CLI: 3.2.3 Juju agent: 3.2.3

mysql-k8s charm revision: 99 mysql-router-k8s charm revision: 69

microk8s: 1.26-strict/stable

Log output

Logs from the failed deployment are linked from:

https://bugs.launchpad.net/snap-openstack/+bug/2042906

direct link:

https://oil-jenkins.canonical.com/artifacts/628e5903-4772-4a3e-9b0a-80cc04d3c6d3/index.html

Additional context

https://bugs.launchpad.net/snap-openstack/+bug/2042906

github-actions[bot] commented 11 months ago

https://warthogs.atlassian.net/browse/DPE-2895

carlcsaposs-canonical commented 11 months ago

Waiting for https://github.com/canonical/data-platform-libs/issues/108 before investigating

carlcsaposs-canonical commented 10 months ago

It looks like the connection to MySQL server was quite unreliable

My interpretation of the logs for the failed unit:

2023-11-06T19:52:22.166Z [container-agent] 2023-11-06 19:52:22 ERROR juju-log backend-database:159: Failed to run logged_commands=["shell.connect('relation-159:***@heat-mysql-primary.openstack.svc.cluster.local:3306')", 'result = session.run_sql("SELECT USER, ATTRIBUTE->>\'$.router_id\' FROM INFORMATION_SCHEMA.USER_ATTRIBUTES WHERE ATTRIBUTE->\'$.created_by_user\'=\'relation-159\' AND ATTRIBUTE->\'$.created_by_juju_unit\'=\'heat-cfn-mysql-router/0\'")', 'print(result.fetch_all())']
2023-11-06T19:52:22.166Z [container-agent] stderr:
2023-11-06T19:52:22.166Z [container-agent] Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
2023-11-06T19:52:22.166Z [container-agent] Traceback (most recent call last):
2023-11-06T19:52:22.166Z [container-agent]   File "<string>", line 1, in <module>
2023-11-06T19:52:22.166Z [container-agent] mysqlsh.DBError: MySQL Error (2003): Shell.connect: Can't connect to MySQL server on 'heat-mysql-primary.openstack.svc.cluster.local:3306' (111)

Router fails here when checking if an old router user+metadata needs to be cleaned up: https://github.com/canonical/mysql-router-k8s-operator/blob/1704b4e190e394cfaba6b68b06debd7ae2b9a606/src/workload.py#L230

2023-11-06T19:53:02.306Z [container-agent] 2023-11-06 19:53:02 ERROR juju-log backend-database:159: Failed to bootstrap router
2023-11-06T19:53:02.306Z [container-agent] logged_command=['--bootstrap', 'relation-159:***@heat-mysql-primary.openstack.svc.cluster.local:3306', '--strict', '--conf-set-option', 'http_server.bind_address=127.0.0.1', '--conf-use-gr-notifications']
2023-11-06T19:53:02.306Z [container-agent] stderr:
2023-11-06T19:53:02.306Z [container-agent] Error: The provided server is currently not in a InnoDB cluster group with quorum and thus may contain inaccurate or outdated data.

Router has succeeded in cleaning up the old router user & metadata (since it's failing a line later): https://github.com/canonical/mysql-router-k8s-operator/blob/1704b4e190e394cfaba6b68b06debd7ae2b9a606/src/workload.py#L231

2023-11-06T19:54:20.935Z [container-agent] 2023-11-06 19:54:20 ERROR juju-log backend-database:159: Failed to bootstrap router
2023-11-06T19:54:20.935Z [container-agent] logged_command=['--bootstrap', 'relation-159:***@heat-mysql-primary.openstack.svc.cluster.local:3306', '--strict', '--conf-set-option', 'http_server.bind_address=127.0.0.1', '--conf-use-gr-notifications']
2023-11-06T19:54:20.935Z [container-agent] stderr:
2023-11-06T19:54:20.935Z [container-agent] Error: It appears that a router instance named 'system' has been previously configured in this host. If that instance no longer exists, use the --force option to overwrite it.

While MySQL server was recovering, I'm guessing it overrode/reverted the changes mysql-router made to the router metadata (but not the router user)


I believe the cause of this issue is the same as here: https://github.com/canonical/mysql-k8s-operator/issues/260#issuecomment-1674717593

MySQL server is providing connection information to MySQL Router when it is not ready to serve traffic (i.e. not in a quorum)

Router, when it sees the connection information, assumes that the cluster is available and that any operators router performs will be persisted. Router deletes the router user & router cluster metadata, assuming that if one of those changes goes through, both changes will go through (it deletes the user after the metadata as a safe guard). However, during server's recovery process, the user deletion goes through but the metadata deletion is reverted—causing router to fail to bootstrap

github-actions[bot] commented 10 months ago

https://warthogs.atlassian.net/browse/DPE-3114