cybozu-go / moco

MySQL operator on Kubernetes using GTID-based semi-synchronous replication.
https://cybozu-go.github.io/moco/
Apache License 2.0
273 stars 22 forks source link

rate limit for re-creating MySQL Pods #698

Open masa213f opened 3 months ago

masa213f commented 3 months ago

What

Updating MOCO on a Kubernetes cluster with many MySQLClusters causes MySQL to disconnect for several minutes. In our past failures, MOCO re-created many MySQL Pods (hundreds of pods at that time) almost simultaneously due to MOCO updates. Then, the Cilium could not process the pod update events and delayed switching to the service's backend. This results in the MySQLs being disconnected for several minutes. (This failure may depend on the configuration of the k8s cluster, such as the CNI, etc.)

To prevent such failures, I want to limit the re-creating speed of MySQL Pods.

How

To limit the reconciliation speed of MySQL StatefulSet's partition (Implementing with #628 and #633).

Checklist

ymmt2005 commented 3 months ago

@masa213f TBH, I don't like to add anything for Cilium to Moco. Since it's a Cilium problem, other middleware besides Moco can face similar problems.

masa213f commented 3 months ago

@ymmt2005 Thank you for the comment.

I think, this failure is due to MOCO re-creating many pods at once. So, I want to add some updates to MOCO. It does not have to be a rate limit of partition. Do you have any ideas?

Indeed, just reading the case written here, it seems to be a problem with the Cilium. However, in my view, there are some components that can lead to this failure, and this time, it just happened to be in Cilium. After the Cilium tuning, the kube-controller-manager or other CNIs (depending on the k8s settings and the number of MySQLClusters) may lead to similar problems .

Based on my experience, creating and deleting pods in K8s is a time-consuming process, and we should not create or delete many pods in a short period. So, I want to shift the re-creation timing of MySQL Pods when MOCO updates. There are risks of recurring https://github.com/cybozu-go/moco/issues/517.

ymmt2005 commented 3 months ago

@masa213f Thank you for your opinion.

Do you have any examples of this type of rate limit in other software? Having a lot of MySQLCluster resources is NOT Moco's problem; it's a moco user's problem.

The same can happen, for example, with ECK if a user has a lot of Elasticsearch clusters.

masa213f commented 3 months ago

Do you have any examples of this type of rate limit in other software?

I'll check it out.