Blocked status `failed to recover cluster.`

carlcsaposs-canonical commented 12 months ago

Steps to reproduce

Steps 3-6 from https://microstack.run/#get-started
juju refresh mysql --channel 8.0/edge

Expected behavior

mysql app upgrades successfully & goes into active state

Actual behavior

mysql app enters blocked status failed to recover cluster.

Versions

Operating system: Ubuntu 22.04.3 LTS

Juju CLI: 3.2.3-genericlinux-amd64

Juju agent: 3.2.0

Charm revision: 99 before refresh (current 8.0/stable), 109 after refresh (current 8.0/edge)

microk8s: MicroK8s v1.26.9 revision 6059

Log output

Juju debug log: sunbeam-debug-log.txt sunbeam-debug-log-filtered.txt

unit-mysql-0: 09:22:26 INFO juju.cmd running containerAgent [3.2.0 c7107ada8c471aa3ba105e5433e61861227e2ed4 gc go1.20.4]
unit-mysql-0: 09:22:26 INFO juju.worker.upgradesteps upgrade steps for 3.2.0 have already been run.
unit-mysql-0: 09:22:26 INFO juju.api connection established to "wss://10.150.15.206:17070/model/9b07ebf5-8cf1-4858-8a94-3086f8416535/api"
unit-mysql-0: 09:22:26 INFO juju.worker.migrationminion migration phase is now: NONE
unit-mysql-0: 09:22:26 INFO juju.worker.caasupgrader abort check blocked until version event received
unit-mysql-0: 09:22:26 WARNING juju.worker.proxyupdater unable to set snap core settings [proxy.http= proxy.https= proxy.store=]: exec: "snap": executable file not found in $PATH, output: ""
unit-mysql-0: 09:22:26 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-mysql-0
unit-mysql-0: 09:22:27 INFO juju.worker.uniter hooks are retried true
unit-mysql-0: 09:22:27 INFO juju.downloader downloading from ch:amd64/jammy/mysql-k8s-109
unit-mysql-0: 09:22:27 INFO juju.downloader download verified ("ch:amd64/jammy/mysql-k8s-109")
unit-mysql-0: 09:22:37 INFO juju.worker.uniter found queued "upgrade-charm" hook
unit-mysql-0: 09:22:39 ERROR unit.mysql/0.juju-log Cluster upgrade failed, ensure pre-upgrade checks are ran first.
unit-mysql-0: 09:22:39 INFO juju.worker.uniter found queued "config-changed" hook
unit-mysql-0: 09:22:40 INFO juju.worker.uniter.operation ran "config-changed" hook (via hook dispatching script: dispatch)
unit-mysql-0: 09:22:40 INFO juju.worker.uniter reboot detected; triggering implicit start hook to notify charm
unit-mysql-0: 09:22:41 INFO unit.mysql/0.juju-log Running legacy hooks/start.
unit-mysql-0: 09:22:44 INFO unit.mysql/0.juju-log Setting up the logrotate configurations
unit-mysql-0: 09:22:51 INFO unit.mysql/0.juju-log Unit workload member-state is offline with member-role unknown
unit-mysql-0: 09:22:52 ERROR unit.mysql/0.juju-log Failed to reboot cluster
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-mysql-0/charm/src/mysql_k8s_helpers.py", line 684, in _run_mysqlsh_script
    stdout, _ = process.wait_output()
  File "/var/lib/juju/agents/unit-mysql-0/charm/venv/ops/pebble.py", line 1359, in wait_output
    raise ExecError[AnyStr](self._command, exit_code, out_value, err_value)
ops.pebble.ExecError: non-zero exit code 1 executing ['/usr/bin/mysqlsh', '--no-wizard', '--python', '--verbose=1', '-f', '/tmp/script.py', ';', 'rm', '/tmp/script.py'], stdout='', stderr="Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory\nverbose: 2023-10-23T09:22:52Z: Loading startup files...\nverbose: 2023-10-23T09:22:52Z: Loading plugins...\nverbose: 2023-10-23T09:22:52Z: Connecting to MySQL at: clusteradmin@mysql-0.mysql-endpoints\nverbose: 2023-10-23T09:22:52Z: Shell.connect: tid=33: CONNECTED: mysql-0.mysql-endpoints\nverbose: 2023-10-23T09:22:52Z: Connecting to MySQL at: mysql://clusteradmin@mysql-0.mysql-endpoints:3306?connect-timeout=5000\nverbose: 2023-10-23T09:22:52Z: Dba.reboot_cluster_from_complete_outage: tid=34: CONNECTED: mysql-0.mysql-endpoints:3306\nverbose: 2023-10-23T09:22:52Z: Connecting to MySQL at: mysql://clusteradmin@mysql-0.mysql-endpoints:3306?connect-timeout=5000\nverbose: 2023-10-23T09:22:52Z: Dba.reboot_cluster_from_complete_outage: tid=35: CONNECTED: mysql-0.mysql-endpoints:3306\nverbose: 2023-10-23T09:22:52Z: Group Replication 'group_name' value: 072799b1-7180-11ee-bc9f-76d5c7fb0362\nverbose: 2023-10-23T09:22:52Z: Metadata 'group_name' value: 072799b1-718" [truncated]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-mysql-0/charm/lib/charms/mysql/v0/mysql.py", line 1989, in reboot_from_complete_outage
    self._run_mysqlsh_script("\n".join(reboot_from_outage_command))
  File "/var/lib/juju/agents/unit-mysql-0/charm/src/mysql_k8s_helpers.py", line 687, in _run_mysqlsh_script
    raise MySQLClientError(e.stderr)
charms.mysql.v0.mysql.MySQLClientError: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
verbose: 2023-10-23T09:22:52Z: Loading startup files...
verbose: 2023-10-23T09:22:52Z: Loading plugins...
verbose: 2023-10-23T09:22:52Z: Connecting to MySQL at: clusteradmin@mysql-0.mysql-endpoints
verbose: 2023-10-23T09:22:52Z: Shell.connect: tid=33: CONNECTED: mysql-0.mysql-endpoints
verbose: 2023-10-23T09:22:52Z: Connecting to MySQL at: mysql://clusteradmin@mysql-0.mysql-endpoints:3306?connect-timeout=5000
verbose: 2023-10-23T09:22:52Z: Dba.reboot_cluster_from_complete_outage: tid=34: CONNECTED: mysql-0.mysql-endpoints:3306
verbose: 2023-10-23T09:22:52Z: Connecting to MySQL at: mysql://clusteradmin@mysql-0.mysql-endpoints:3306?connect-timeout=5000
verbose: 2023-10-23T09:22:52Z: Dba.reboot_cluster_from_complete_outage: tid=35: CONNECTED: mysql-0.mysql-endpoints:3306
verbose: 2023-10-23T09:22:52Z: Group Replication 'group_name' value: 072799b1-7180-11ee-bc9f-76d5c7fb0362
verbose: 2023-10-23T09:22:52Z: Metadata 'group_name' value: 072799b1-7180-11ee-bc9f-76d5c7fb0362
verbose: 2023-10-23T09:22:52Z: Connecting to MySQL at: mysql://clusteradmin@mysql-0.mysql-endpoints.openstack.svc.cluster.local:3306?connect-timeout=5000
verbose: 2023-10-23T09:22:52Z: Dba.reboot_cluster_from_complete_outage: tid=36: CONNECTED: mysql-0.mysql-endpoints.openstack.svc.cluster.local:3306
verbose: 2023-10-23T09:22:52Z: Connecting to MySQL at: mysql://clusteradmin@mysql-0.mysql-endpoints.openstack.svc.cluster.local:3306?connect-timeout=5000
verbose: 2023-10-23T09:22:52Z: Dba.reboot_cluster_from_complete_outage: tid=37: CONNECTED: mysql-0.mysql-endpoints.openstack.svc.cluster.local:3306
No PRIMARY member found for cluster 'cluster-b56bbe7bd4a6cc012b44ba93360df3b5'
verbose: 2023-10-23T09:22:52Z: ClusterSet info: member, primary, not primary_invalidated, not removed from set, primary status: UNKNOWN
Restoring the Cluster 'cluster-b56bbe7bd4a6cc012b44ba93360df3b5' from complete outage...

ERROR: RuntimeError: The current session instance does not belong to the Cluster: 'cluster-b56bbe7bd4a6cc012b44ba93360df3b5'.
Traceback (most recent call last):
  File "<string>", line 2, in <module>
RuntimeError: Dba.reboot_cluster_from_complete_outage: The current session instance does not belong to the Cluster: 'cluster-b56bbe7bd4a6cc012b44ba93360df3b5'.

unit-mysql-0: 09:22:53 INFO juju.worker.uniter.operation ran "mysql-pebble-ready" hook (via hook dispatching script: dispatch)

Additional context

Attempted to reproduce issue encountered by @javacruft

github-actions[bot] commented 12 months ago

https://warthogs.atlassian.net/browse/DPE-2832

carlcsaposs-canonical commented 12 months ago

potential cause: ERROR unit.mysql/0.juju-log Cluster upgrade failed, ensure pre-upgrade checks are ran first.

carlcsaposs-canonical commented 12 months ago

Tried with pre-upgrade-check before juju refresh

Result: blocked status upgrade failed. Check logs for rollback instruction

pre-upgrade-debug-log.txt pre-upgrade-debug-log-filtered.txt

gboutry commented 11 months ago

Encountered the same issue in a deployment with 7 mysql servers. 5 out of the 7 failed to recover after a machine reboot with the same error.

Complete debug log: debug-log.log Each failing mysql server logs: cinder-mysql.log heat-mysql.log keystone-mysql.log nova-mysql.log placement-mysql.log

paulomach commented 11 months ago

Encountered the same issue in a deployment with 7 mysql servers. 5 out of the 7 failed to recover after a machine reboot with the same error.

@gboutry there's a fix on PR #324, released in edge channel. We are working to promote it to stable.

paulomach commented 9 months ago

@gboutry have you had the chance to validate the fix?

gboutry commented 1 month ago

I have not seen this issue for a long time, will re-open if seen again.

paulomach commented 4 weeks ago

Considering fixed.

canonical / mysql-k8s-operator