canonical / mysql-k8s-operator

A Charmed Operator for running MySQL on Kubernetes
https://charmhub.io/mysql-k8s
Apache License 2.0
8 stars 15 forks source link

The MySQL cluster is not recovered automatically under certain scenarios. #485

Open peterctl opened 2 months ago

peterctl commented 2 months ago

Killing multiple MySQL pods at the same time without waiting for them to fully come online will put the cluster in an unhealthy state, but it will not trigger the reboot cluster from complete outage flow to recover the cluster. This will also happen during a networking outage, which in this case was simulated by taking down all NICs on all the microk8s machines in a single AZ.

Steps to reproduce

  1. By killing pods. a. Kill multiple mysql pods at the same time.
    kubectl delete mysql-0
    kubectl delete mysql-1
  2. By simulating a network outage. a. Take down all NICs of the microk8s machines.
    juju machines -m microk8s | awk '/AZ3/ {print $1}' | while read machine; do
      juju ssh $machine '
        for nic in bond0 bondM; do  # adjust as necessary
          sudo ip link set dev $nic down
        done
      '
    done

    b. Wait 15 minutes for the network outage to affect mysql, then reboot all the machines in the AZ to bring the network back online.

Expected behavior

The cluster will go offline and the reboot cluster from complete outage flow will be triggered to recover the cluster.

Actual behavior

The leader unit will not go offline and the reboot cluster flow will not be triggered, leaving the cluster in an inconsistent state.

mysql/0          maintenance  idle       10.1.1.10         offline
mysql/1          maintenance  idle       10.1.1.11         offline
mysql/2*         maintenance  idle       10.1.1.12         Unable to get member state

Versions

Operating system: Ubuntu 22.04 Jammy Jellyfish Juju CLI: 3.4.3 Juju agent: 3.4.3 Charm revision: 153 (channel 8.0/stable) microk8s: 1.28

Additional context

To recover the cluster, the mysqld_safe Pebble service needs to be restarted inside the leader unit:

juju ssh --container mysql mysql/leader /charm/bin/pebble restart mysqld_safe
github-actions[bot] commented 2 months ago

https://warthogs.atlassian.net/browse/DPE-5183

paulomach commented 2 months ago

Thanks for the detailed report @peterctl , it will be useful to create a test. We will queue it for fix.

natalytvinova commented 1 month ago

Facing this bug too

shayancanonical commented 1 month ago

for documentation: if the pod is killed while mysql-pebble-ready is running for the first time, we see the following error trace. i think that there is more we can do to gracefully recover, but still gathering more info for the below encountered error

unit-mysql-k8s4-0: 21:44:26 INFO juju.downloader download verified ("ch:amd64/jammy/mysql-k8s-201")                                                                                               [330/1099]
unit-mysql-k8s4-0: 21:44:35 INFO juju.worker.uniter hooks are retried true                                                                                                                                  
unit-mysql-k8s4-0: 21:44:35 INFO juju.worker.uniter found queued "install" hook                                                                                                                             
unit-mysql-k8s4-0: 21:44:37 INFO unit.mysql-k8s4/0.juju-log Running legacy hooks/install.                                                                                                                   
unit-mysql-k8s4-0: 21:44:38 INFO juju.worker.uniter.operation ran "install" hook (via hook dispatching script: dispatch)                                                                                    
unit-mysql-k8s4-0: 21:44:39 INFO juju.worker.uniter.operation ran "upgrade-relation-created" hook (via hook dispatching script: dispatch)                                                                   
unit-mysql-k8s4-0: 21:44:40 INFO juju.worker.uniter.operation ran "restart-relation-created" hook (via hook dispatching script: dispatch)                                                                   
unit-mysql-k8s4-0: 21:44:41 INFO juju.worker.uniter.operation ran "database-peers-relation-created" hook (via hook dispatching script: dispatch)                                                            
unit-mysql-k8s4-0: 21:44:41 INFO juju.worker.uniter found queued "leader-elected" hook                                                                                                                      
unit-mysql-k8s4-0: 21:44:43 INFO juju.worker.uniter.operation ran "leader-elected" hook (via hook dispatching script: dispatch)                                                                             
unit-mysql-k8s4-0: 21:44:43 INFO unit.mysql-k8s4/0.juju-log Setting up the logrotate configurations                                                                                                         
unit-mysql-k8s4-0: 21:44:54 INFO unit.mysql-k8s4/0.juju-log Configuring instance                                                                                                                            
unit-mysql-k8s4-0: 21:44:55 INFO unit.mysql-k8s4/0.juju-log Installing plugin='audit_log'                                                                                                                   
unit-mysql-k8s4-0: 21:44:55 INFO unit.mysql-k8s4/0.juju-log Installing plugin='audit_log_filter'                                                                                                            
unit-mysql-k8s4-0: 21:45:06 WARNING unit.mysql-k8s4/0.juju-log Failed to check if cluster metadata exists from_instance='mysql-k8s4-0.mysql-k8s4-endpoints.database.svc.cluster.local'                      
unit-mysql-k8s4-0: 21:45:06 INFO unit.mysql-k8s4/0.juju-log Creating cluster cluster-7d7613ead7f7225f76cd0d6003b5c3d5                                                                                       
unit-mysql-k8s4-0: 21:45:09 INFO juju.worker.caasunitterminationworker terminating due to SIGTERM                                                                                                           
unit-mysql-k8s4-0: 21:45:09 WARNING juju.worker.uniter.operation hook "mysql-pebble-ready" was terminated                                                                                                   
unit-mysql-k8s4-0: 21:45:09 ERROR juju.worker.uniter pebble poll failed for container "mysql": failed to send pebble-ready event: terminated                                                                
unit-mysql-k8s4-0: 21:45:10 INFO unit.mysql-k8s4/0.juju-log Setting up the logrotate configurations                                                                                                         
unit-mysql-k8s4-0: 21:45:10 INFO unit.mysql-k8s4/0.juju-log Reconciling the pebble layer                                                                                                                    
unit-mysql-k8s4-0: 21:45:10 ERROR unit.mysql-k8s4/0.juju-log Uncaught exception while in charm code:                                                                                                        
Traceback (most recent call last):                                                                                                                                                                          
  File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/./src/charm.py", line 928, in <module>                                                                                                                 
    main(MySQLOperatorCharm)                                                                                                                                                                                
  File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/venv/ops/main.py", line 551, in main                                                                                                                   
    manager.run()                                                                                                                                                                                           
  File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/venv/ops/main.py", line 530, in run                                                                                                                    
    self._emit()                                                                                                                                                                                            
  File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/venv/ops/main.py", line 519, in _emit                                                                                                                  
    _emit_charm_event(self.charm, self.dispatcher.event_name)                                                                                                                                               
  File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/venv/ops/main.py", line 147, in _emit_charm_event                                                                                                      
    event_to_emit.emit(*args, **kwargs)                                                                                                                                                                     
  File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/venv/ops/framework.py", line 348, in emit                                                                                                              
    framework._emit(event)                                                                                                                                                                                  
  File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/venv/ops/framework.py", line 860, in _emit                                                                                                             
    self._reemit(event_path)                                                                                                                                                                                
  File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/venv/ops/framework.py", line 950, in _reemit                                                                                                           
    custom_handler(event)                                                                                                                                                                                   
  File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 735, in wrapped_function                                                                               
    return callable(*args, **kwargs)  # type: ignore                                                                                                                                                        
  File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/./src/charm.py", line 700, in _on_mysql_pebble_ready                                                                                                   
    self._reconcile_pebble_layer(container)                                                                                                                                                                 
  File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 735, in wrapped_function                                                                               
    return callable(*args, **kwargs)  # type: ignore                                                                                                                                                        
  File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/./src/charm.py", line 412, in _reconcile_pebble_layer                                                                                                  
    container.replan()                                                                                                                                                                                      
  File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/venv/ops/model.py", line 2259, in replan                                                                                                               
    self._pebble.replan_services()                                                                                                                                                                          
  File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/venv/ops/pebble.py", line 2129, in replan_services                                                                                                     
    return self._services_action('replan', [], timeout, delay)                                                                                                                                              
  File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/venv/ops/pebble.py", line 2226, in _services_action                                                                                                    
    raise ChangeError(change.err, change)                                                                                                                                                                   
ops.pebble.ChangeError: cannot perform the following tasks:                                                                                                                                                 
- Start service "mysqld_safe" (cannot start service while terminating)                                                                                                                                      
----- Logs from task 0 -----                                                                                                                                                                                
2024-09-26T21:45:10Z ERROR cannot start service while terminating                                                                                                                                           
-----                                
unit-mysql-k8s4-0: 21:45:10 ERROR juju.worker.uniter.operation hook "mysql-pebble-ready" (via hook dispatching script: dispatch) failed: exit status 1                                                      
unit-mysql-k8s4-0: 21:45:10 ERROR juju.worker.uniter pebble poll failed for container "mysql": failed to send pebble-ready event: hook failed                                                               
unit-mysql-k8s4-0: 21:45:12 INFO juju.worker.uniter.operation ran "database-storage-attached" hook (via hook dispatching script: dispatch)                                                                  
unit-mysql-k8s4-0: 21:45:12 INFO juju.worker.uniter unit "mysql-k8s4/0" shutting down: agent should be terminated                                                                                           
unit-mysql-k8s4-0: 21:45:28 INFO juju.cmd running containerAgent [3.5.3 63d460f9ee6c7c710131961390687e7a0ab90470 gc go1.21.12]                                                                              
unit-mysql-k8s4-0: 21:45:28 INFO juju.cmd.containeragent.unit start "unit"                                                                                                                                  
unit-mysql-k8s4-0: 21:45:28 INFO juju.worker.upgradesteps upgrade steps for 3.5.3 have already been run.                                                                                                    
unit-mysql-k8s4-0: 21:45:28 INFO juju.worker.probehttpserver starting http server on 127.0.0.1:65301                                                                                                        
unit-mysql-k8s4-0: 21:45:28 INFO juju.api cannot resolve "controller-service.controller-uk8s-3-5-3.svc.cluster.local": lookup controller-service.controller-uk8s-3-5-3.svc.cluster.local: operation was canc
eled                                                                                                                                                                                                        
unit-mysql-k8s4-0: 21:45:28 INFO juju.api connection established to "wss://10.152.183.30:17070/model/31defe4d-ef81-431c-8d1c-376ecd60e1e7/api"                                                              
unit-mysql-k8s4-0: 21:45:28 INFO juju.worker.apicaller [31defe] "unit-mysql-k8s4-0" successfully connected to "10.152.183.30:17070"                                                                         
unit-mysql-k8s4-0: 21:45:28 INFO juju.api connection established to "wss://controller-service.controller-uk8s-3-5-3.svc.cluster.local:17070/model/31defe4d-ef81-431c-8d1c-376ecd60e1e7/api"                 
unit-mysql-k8s4-0: 21:45:28 INFO juju.worker.apicaller [31defe] "unit-mysql-k8s4-0" successfully connected to "controller-service.controller-uk8s-3-5-3.svc.cluster.local:17070"                            
unit-mysql-k8s4-0: 21:45:28 INFO juju.worker.migrationminion migration migration phase is now: NONE                                                                                                         
unit-mysql-k8s4-0: 21:45:28 INFO juju.worker.logger logger worker started                                                                                                                                   
unit-mysql-k8s4-0: 21:45:28 INFO juju.worker.leadership mysql-k8s4/0 promoted to leadership of mysql-k8s4                                                                                                   
unit-mysql-k8s4-0: 21:45:28 WARNING juju.worker.proxyupdater unable to set snap core settings [proxy.http= proxy.https= proxy.store=]: exec: "snap": executable file not found in $PATH, output: ""         
unit-mysql-k8s4-0: 21:45:28 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-mysql-k8s4-0                                                                                            
unit-mysql-k8s4-0: 21:45:28 INFO juju.worker.caasupgrader abort check blocked until version event received                                                                                                  
unit-mysql-k8s4-0: 21:45:28 INFO juju.worker.caasupgrader unblocking abort check                                                                                                                            
unit-mysql-k8s4-0: 21:45:28 INFO juju.worker.uniter unit "mysql-k8s4/0" started                                                                                                                             
unit-mysql-k8s4-0: 21:45:28 INFO juju.worker.uniter hooks are retried true                                                                                                                                  
unit-mysql-k8s4-0: 21:45:28 INFO juju.worker.uniter.charm downloading ch:amd64/jammy/mysql-k8s-201 from API server
unit-mysql-k8s4-0: 21:45:28 INFO juju.downloader downloading from ch:amd64/jammy/mysql-k8s-201
unit-mysql-k8s4-0: 21:45:29 INFO juju.downloader download complete ("ch:amd64/jammy/mysql-k8s-201")
unit-mysql-k8s4-0: 21:45:29 INFO juju.downloader download verified ("ch:amd64/jammy/mysql-k8s-201")
unit-mysql-k8s4-0: 21:45:38 INFO juju.worker.uniter found queued "upgrade-charm" hook
unit-mysql-k8s4-0: 21:45:40 INFO unit.mysql-k8s4/0.juju-log Running legacy hooks/upgrade-charm.
unit-mysql-k8s4-0: 21:45:40 ERROR unit.mysql-k8s4/0.juju-log Cluster upgrade failed, ensure pre-upgrade checks are ran first.
unit-mysql-k8s4-0: 21:45:41 INFO juju.worker.uniter.operation ran "upgrade-charm" hook (via hook dispatching script: dispatch)
unit-mysql-k8s4-0: 21:45:41 INFO juju.worker.uniter found queued "config-changed" hook
unit-mysql-k8s4-0: 21:45:42 INFO juju.worker.uniter.operation ran "config-changed" hook (via hook dispatching script: dispatch)
unit-mysql-k8s4-0: 21:45:42 INFO juju.worker.uniter found queued "start" hook
unit-mysql-k8s4-0: 21:45:43 INFO unit.mysql-k8s4/0.juju-log Running legacy hooks/start.
unit-mysql-k8s4-0: 21:45:44 INFO juju.worker.uniter.operation ran "start" hook (via hook dispatching script: dispatch)
unit-mysql-k8s4-0: 21:45:45 INFO unit.mysql-k8s4/0.juju-log Setting up the logrotate configurations
unit-mysql-k8s4-0: 21:45:45 INFO unit.mysql-k8s4/0.juju-log Reconciling the pebble layer
unit-mysql-k8s4-0: 21:45:49 INFO juju.worker.uniter.operation ran "mysql-pebble-ready" hook (via hook dispatching script: dispatch)
unit-mysql-k8s4-0: 21:45:51 INFO juju.worker.uniter.operation ran "upgrade-relation-changed" hook (via hook dispatching script: dispatch)
unit-mysql-k8s4-0: 21:45:52 INFO juju.worker.uniter.operation ran "database-peers-relation-changed" hook (via hook dispatching script: dispatch)
unit-mysql-k8s4-0: 21:49:41 ERROR unit.mysql-k8s4/0.juju-log Failed to get cluster status for cluster-7d7613ead7f7225f76cd0d6003b5c3d5
unit-mysql-k8s4-0: 21:49:41 ERROR unit.mysql-k8s4/0.juju-log Failed to get cluster endpoints
Traceback (most recent call last):
unit-mysql-k8s4-0: 21:49:41 ERROR unit.mysql-k8s4/0.juju-log Failed to get cluster endpoints
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/src/mysql_k8s_helpers.py", line 792, in update_endpoints
    rw_endpoints, ro_endpoints, offline = self.get_cluster_endpoints(get_ips=False)
  File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 735, in wrapped_function
    return callable(*args, **kwargs)  # type: ignore
  File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/lib/charms/mysql/v0/mysql.py", line 1932, in get_cluster_endpoints
    raise MySQLGetClusterEndpointsError("Failed to get endpoints from cluster status")
charms.mysql.v0.mysql.MySQLGetClusterEndpointsError: Failed to get endpoints from cluster status
unit-mysql-k8s4-0: 21:49:42 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-mysql-k8s4-0: 21:54:05 ERROR unit.mysql-k8s4/0.juju-log Failed to get cluster status for cluster-7d7613ead7f7225f76cd0d6003b5c3d5
unit-mysql-k8s4-0: 21:54:05 ERROR unit.mysql-k8s4/0.juju-log Failed to get cluster endpoints
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/src/mysql_k8s_helpers.py", line 792, in update_endpoints
    rw_endpoints, ro_endpoints, offline = self.get_cluster_endpoints(get_ips=False)
  File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 735, in wrapped_function
    return callable(*args, **kwargs)  # type: ignore
  File "/var/lib/juju/agents/unit-mysql-k8s4-0/charm/lib/charms/mysql/v0/mysql.py", line 1932, in get_cluster_endpoints
    raise MySQLGetClusterEndpointsError("Failed to get endpoints from cluster status")
charms.mysql.v0.mysql.MySQLGetClusterEndpointsError: Failed to get endpoints from cluster status
unit-mysql-k8s4-0: 21:54:05 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
shayancanonical commented 3 weeks ago

@peterctl i have tried, unsuccessfully, to replicate the issue locally (with rev 153 of mysql-k8s). i deployed a cluster of 3 nodes and tried at least 15 times to delete all pods in the cluster without facing the issue. Furthermore, juju machines is not valid for a k8s model. Would you be able to share more about the environment you faced this issue on?

Also, if you run into the above issue again, please collect the logs if you are able (in particular the debug logs, pebble logs, and mysqld logs in /var/log/mysql/error.log) - the logs will help us better determine what is happening.

natalytvinova commented 2 weeks ago

@shayancanonical as I meantioned we're also facing this bug, it's in the ongoing deployment and the workaround doesn't help. I would like to raise a field-critical kubeflow-bundle.yaml.txt var-log-mysql-error-log.txt pebble-logs.txt mlflow-mysql-pod-log.txt

In my case the charms are restarted/failed after initial redeployment in the next day or 2

taurus-forever commented 2 weeks ago

@natalytvinova can you please share your Juju version? Thank you!

taurus-forever commented 2 weeks ago

Just got internal reply. It should be Juju 3.5.1. This Juju version is affected by random units restart (until they cannot re-join the cluster any-longer), see: https://bugs.launchpad.net/juju/+bug/2075139 It is recommended to stay with Juju 3.4 in K8s production for now.