bitpoke / mysql-operator

Asynchronous MySQL Replication on Kubernetes using Percona Server and Openark's Orchestrator.
https://www.bitpoke.io/docs/mysql-operator/getting-started/
Apache License 2.0
993 stars 275 forks source link

Data Loss and Cluster Failure in Kubernetes StatefulSet Due to Missing Disk for MySQL Replica-0 #898

Open tebaly opened 11 months ago

tebaly commented 11 months ago

I encountered an unexpected failure during node replacement in my Kubernetes cluster, leading to a critical issue with the MySQL StatefulSet. The failure resulted in the loss of the disk for the MySQL replica with index 0, causing the replica to be unable to start. While the other two replicas had up-to-date data, they couldn't initiate due to the StatefulSet's hanging startup process for the first replica, which experienced data loss.

To address such issues, I propose leveraging the new Kubernetes v1.24 feature - .spec.updateStrategy.rollingUpdate.maxUnavailable. You can set it equal to the number of replicas in the StatefulSet, for instance, with three replicas and maxUnavailable = 3. This way, the remaining replicas with valid data might be able to launch successfully.

The current situation left me with no apparent method to utilize the data from the other replicas to recover from the failure. Consequently, I had to resort to restoring from a backup, causing additional downtime and administrative efforts.

I believe adopting the suggested feature could significantly enhance the reliability and fault-tolerance of StatefulSets in similar scenarios, preventing potential data loss and cluster failures.

Feature State: Kubernetes v1.24 [alpha]

Thank you for considering this proposal. Best regards