canonical / mysql-k8s-operator

A Charmed Operator for running MySQL on Kubernetes
https://charmhub.io/mysql-k8s
Apache License 2.0
8 stars 15 forks source link

Cluster fails to recover from loss of quorum #358

Open carlcsaposs-canonical opened 9 months ago

carlcsaposs-canonical commented 9 months ago

Steps to reproduce

  1. deploy 3 units mysql-k8s from stable
  2. (optional) relate to mysql-router-k8s from https://github.com/canonical/mysql-router-k8s-operator/pull/190
  3. run (if unit 0 is primary)
    >>> while True:
    ...     for pod in (1, 2):
    ...             subprocess.run(f"kubectl -n foo2 delete pod mysql-k8s-{pod} --force".split())
    ...     time.sleep(5)
  4. ctrl-c to break while loop
  5. Wait, run jhack ffwd, wait—server doesn't recover

Expected behavior

Server recovers from loss of quorum

Actual behavior

Server stays stuck w/o quorum

Versions

Operating system: Ubuntu 22.04

Juju CLI: 3.1.7-genericlinux-amd64

Juju agent: 3.1.7

Charm revision: 113

microk8s: MicroK8s v1.28.3 revision 6091

Log output

Juju debug log: no-quorum-stuck-debug-log.txt

unit-mysql-k8s-0: 10:06:33 WARNING unit.mysql-k8s/0.juju-log Failed to get cluster primary addresses
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/src/mysql_k8s_helpers.py", line 666, in _run_mysqlsh_script
    stdout, _ = process.wait_output()
  File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/pebble.py", line 1441, in wait_output
    raise ExecError[AnyStr](self._command, exit_code, out_value, err_value)
ops.pebble.ExecError: non-zero exit code 1 executing ['/usr/bin/mysqlsh', '--no-wizard', '--python', '--verbose=1', '-f', '/tmp/script.py', ';', 'rm', '/tmp/script.py'], stdout='', stderr='Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory\nverbose: 2024-01-19T10:06:33Z: Loading startup files...\nverbose: 2024-01-19T10:06:33Z: Loading plugins...\nverbose: 2024-01-19T10:06:33Z: Connecting to MySQL at: clusteradmin@mysql-k8s-0.mysql-k8s-endpoints.foo2.svc.cluster.local\nverbose: 2024-01-19T10:06:33Z: Shell.connect_to_primary: tid=5730: CONNECTED: mysql-k8s-0.mysql-k8s-endpoints.foo2.svc.cluster.local\nverbose: 2024-01-19T10:06:33Z: Redirecting session from \'mysqlx://clusteradmin@mysql-k8s-0.mysql-k8s-endpoints.foo2.svc.cluster.local:33060\' to a PRIMARY of an InnoDB cluster or ReplicaSet...\nTraceback (most recent call last):\n  File "<string>", line 1, in <module>\nmysqlsh.Error: Shell Error (51011): Shell.connect_to_primary: The InnoDB cluster appears to be under a partial or total outage and an ONLINE PRIMARY cannot be selected. (Group has no quorum)\n'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/lib/charms/mysql/v0/mysql.py", line 1767, in get_cluster_primary_address
    output = self._run_mysqlsh_script("\n".join(get_cluster_primary_commands))
  File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/src/mysql_k8s_helpers.py", line 669, in _run_mysqlsh_script
    raise MySQLClientError(e.stderr)
charms.mysql.v0.mysql.MySQLClientError: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
verbose: 2024-01-19T10:06:33Z: Loading startup files...
verbose: 2024-01-19T10:06:33Z: Loading plugins...
verbose: 2024-01-19T10:06:33Z: Connecting to MySQL at: clusteradmin@mysql-k8s-0.mysql-k8s-endpoints.foo2.svc.cluster.local
verbose: 2024-01-19T10:06:33Z: Shell.connect_to_primary: tid=5730: CONNECTED: mysql-k8s-0.mysql-k8s-endpoints.foo2.svc.cluster.local
verbose: 2024-01-19T10:06:33Z: Redirecting session from 'mysqlx://clusteradmin@mysql-k8s-0.mysql-k8s-endpoints.foo2.svc.cluster.local:33060' to a PRIMARY of an InnoDB cluster or ReplicaSet...
Traceback (most recent call last):
  File "<string>", line 1, in <module>
mysqlsh.Error: Shell Error (51011): Shell.connect_to_primary: The InnoDB cluster appears to be under a partial or total outage and an ONLINE PRIMARY cannot be selected. (Group has no quorum)

Additional context

If all pods are deleted (including primary), server usually recovers

github-actions[bot] commented 9 months ago

https://warthogs.atlassian.net/browse/DPE-3334