Open masa213f opened 3 months ago
@masa213f TBH, I don't like to add anything for Cilium to Moco. Since it's a Cilium problem, other middleware besides Moco can face similar problems.
@ymmt2005 Thank you for the comment.
I think, this failure is due to MOCO re-creating many pods at once. So, I want to add some updates to MOCO. It does not have to be a rate limit of partition. Do you have any ideas?
Indeed, just reading the case written here, it seems to be a problem with the Cilium. However, in my view, there are some components that can lead to this failure, and this time, it just happened to be in Cilium. After the Cilium tuning, the kube-controller-manager or other CNIs (depending on the k8s settings and the number of MySQLClusters) may lead to similar problems .
Based on my experience, creating and deleting pods in K8s is a time-consuming process, and we should not create or delete many pods in a short period. So, I want to shift the re-creation timing of MySQL Pods when MOCO updates. There are risks of recurring https://github.com/cybozu-go/moco/issues/517.
@masa213f Thank you for your opinion.
Do you have any examples of this type of rate limit in other software? Having a lot of MySQLCluster resources is NOT Moco's problem; it's a moco user's problem.
The same can happen, for example, with ECK if a user has a lot of Elasticsearch clusters.
Do you have any examples of this type of rate limit in other software?
I'll check it out.
What
Updating MOCO on a Kubernetes cluster with many MySQLClusters causes MySQL to disconnect for several minutes. In our past failures, MOCO re-created many MySQL Pods (hundreds of pods at that time) almost simultaneously due to MOCO updates. Then, the Cilium could not process the pod update events and delayed switching to the service's backend. This results in the MySQLs being disconnected for several minutes. (This failure may depend on the configuration of the k8s cluster, such as the CNI, etc.)
To prevent such failures, I want to limit the re-creating speed of MySQL Pods.
How
To limit the reconciliation speed of MySQL StatefulSet's partition (Implementing with #628 and #633).
Checklist