canonical / mysql-k8s-operator

A Charmed Operator for running MySQL on Kubernetes
https://charmhub.io/mysql-k8s
Apache License 2.0
8 stars 15 forks source link

restarted secondary fails to join the cluster back #415

Open gboutry opened 1 month ago

gboutry commented 1 month ago

Steps to reproduce

  1. rollout restart pods in a 3 units cluster (it's not 100% reproducible, but happen often enough)

Expected behavior

Secondary should join the cluster back

Actual behavior

Secondary is not joining back, considered as offline

Versions

Operating system:

Juju CLI:

Juju agent:

Charm revision: 127

microk8s: MicroK8s v1.28.7 revision 6532

Log output

2024-05-16T13:45:18.111Z [container-agent] 2024-05-16 13:45:18 INFO juju-log Unit workload member-state is offline with member-role unknown
2024-05-16T13:45:21.896Z [container-agent] 2024-05-16 13:45:21 ERROR juju-log Failed to get cluster status for cluster-ab0e762c137dc447d08ce68b19fb20b3
2024-05-16T13:45:21.903Z [container-agent] 2024-05-16 13:45:21 ERROR juju-log Failed to get cluster endpoints
2024-05-16T13:45:21.903Z [container-agent] Traceback (most recent call last):
2024-05-16T13:45:21.903Z [container-agent]   File "/var/lib/juju/agents/unit-heat-mysql-0/charm/src/mysql_k8s_helpers.py", line 836, in update_endpoints
2024-05-16T13:45:21.903Z [container-agent]     rw_endpoints, ro_endpoints, offline = self.get_cluster_endpoints(get_ips=False)
2024-05-16T13:45:21.903Z [container-agent]   File "/var/lib/juju/agents/unit-heat-mysql-0/charm/lib/charms/mysql/v0/mysql.py", line 1469, in get_cluster_endpoints
2024-05-16T13:45:21.903Z [container-agent]     raise MySQLGetClusterEndpointsError("Failed to get endpoints from cluster status")
2024-05-16T13:45:21.903Z [container-agent] charms.mysql.v0.mysql.MySQLGetClusterEndpointsError: Failed to get endpoints from cluster status
2024-05-16T13:45:22.191Z [container-agent] 2024-05-16 13:45:22 INFO juju.worker.uniter.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
2024-05-16T13:47:53.387910Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member was unable to join the group. Local port: 3306'
2024-05-16T13:48:00.275796Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error connecting to all peers. Member join failed. Local port: 3306'
2024-05-16T13:48:00.385285Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member was unable to join the group. Local port: 3306'
2024-05-16T13:48:07.654156Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error connecting to all peers. Member join failed. Local port: 3306'
2024-05-16T13:48:07.767533Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member was unable to join the group. Local port: 3306'
2024-05-16T13:48:08.469058Z 28247 [ERROR] [MY-011640] [Repl] Plugin group_replication reported: 'Timeout on wait for view after joining group'
2024-05-16T13:48:08.469343Z 28247 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is already leaving or joining a group.'

Additional context

After a debugging session with @paulomach, we got the instance to successfully join back using: c.rejoin_instance("heat-mysql-0.heat-mysql-endpoints.openstack.svc.cluster.local:3306")

The command was performed from the failed unit to the primary unit (ruling out connection issue)

github-actions[bot] commented 1 month ago

https://warthogs.atlassian.net/browse/DPE-4375