Open chrislovecnm opened 3 years ago
Got this again
partitioned update: crdb: error applying updateStrategyFunc to crdb crdb-test-rm4rwc: Operation cannot be fulfilled on statefulsets.apps \"crdb\": the object has been modified; please apply your changes to the latest version and try again", "errorVerbose": "Operation cannot be fulfilled on statefulsets.apps \"crdb\": the object has been modified; please apply your changes to the latest version and try again\nerror applying updateStrategyFunc to crdb crdb-test-rm4rwc\ngithub.com/cockroachdb/cockroach-operator/pkg/update.UpdateClusterRegionStatefulSet\n\tpkg/update/update.go:151\ngithub.com/cockroachdb/cockroach-operator/pkg/update.updateClusterStatefulSets\n\tpkg/update/update_cockroach_version.go:138\ngithub.com/cockroachdb/cockroach-operator/pkg/update.UpdateClusterCockroachVersion\n\tpkg/update/update_cockroach_version.go:123\ngithub.com/cockroachdb/cockroach-operator/pkg/actor.(*partitionedUpdate).Act\n\tpkg/actor/partitioned_update.go:223\ngithub.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n\tpkg/controller/cluster_controller.go:130\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99\nruntime.goexit\n\tGOROOT/src/runtime/asm_amd64.s:1371\nfailed to update sts with partitioned update: crdb\ngithub.com/cockroachdb/cockroach-operator/pkg/actor.(*partitionedUpdate).Act\n\tpkg/actor/partitioned_update.go:236\ngithub.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n\tpkg/controller/cluster_controller.go:130\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99\nruntime.goexit\n\tGOROOT/src/runtime/asm_amd64.s:1371"}
2021-06-15T21:57:59.5722023Z logger.go:130: 2021-06-15T21:57:59.497Z INFO reconciling CockroachDB c
and also got this:
2021-06-15T21:55:28.9741696Z logger.go:130: 2021-06-15T21:55:28.895Z INFO Error on action {"CrdbCluster": "crdb-test-rm4rwc/crdb", "Action": "RequestCert", "err": "failed saving the annotations on request certificate: Operation cannot be fulfilled on crdbclusters.crdb.cockroachlabs.com \"crdb\": the object has been modified; please apply your changes to the latest version and try again"}
2021-06-15T21:55:28.9757070Z logger.go:130: 2021-06-15T21:55:28.895Z ERROR action failed {"CrdbCluster": "crdb-test-rm4rwc/crdb", "error": "failed saving the annotations on request certificate: Operation cannot be fulfilled on crdbclusters.crdb.cockroachlabs.com \"crdb\": the object has been modified; please apply your changes to the latest version and try again", "errorVerbose": "Operation cannot be fulfilled on crdbclusters.crdb.cockroachlabs.com \"crdb\": the object has been modified; please apply your changes to the latest version and try again
\nfailed saving the annotations on request certificate
\ngithub.com/cockroachdb/cockroach-operator/pkg/actor.(*generateCert).Act
\n\tpkg/actor/generate_cert.go:157
\ngithub.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile
\n\tpkg/controller/cluster_controller.go:130
\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297
\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252
\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2
\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215
\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1
\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185
\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155
\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil
\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156
\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99\nruntime.goexit\n\tGOROOT/src/runtime/asm_amd64.s:1371"}
We are running into eventual consistency. We need to check for
and handle this scenario every time we are updating objects. See
https://github.com/kubernetes/kubernetes/issues/84430#issuecomment-638376994
And there is a retry for this in client.go
https://pkg.go.dev/k8s.io/client-go/util/retry#RetryOnConflict - example code from the docs
err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
// Fetch the resource here; you need to refetch it on every try, since
// if you got a conflict on the last update attempt then you need to get
// the current version before making your own changes.
pod, err := c.Pods("mynamespace").Get(name, metav1.GetOptions{})
if err ! nil {
return err
}
// Make whatever updates to the resource are needed
pod.Status.Phase = v1.PodFailed
// Try to update
_, err = c.Pods("mynamespace").UpdateStatus(pod)
// You have to return err itself here (not wrapped inside another error)
// so that RetryOnConflict can identify it correctly.
return err
})
if err != nil {
// May be conflict if max retries were hit, or may be something unrelated
// like permissions or a network error
return err
}
We need to refactor these. And also go through the other actors as well.
./actor/cluster_restart.go:187: _, err = clientset.AppsV1().StatefulSets(stsNamespace).Update(ctx, sts, metav1.UpdateOptions{})
./actor/cluster_restart.go:218: _, err := clientset.AppsV1().StatefulSets(stsNamespace).Update(ctx, sts, metav1.UpdateOptions{})
./update/update.go:198: _, err := updateSts.clientset.AppsV1().StatefulSets(stsNamespace).Update(updateSts.ctx, sts, metav1.UpdateOptions{})
./actor/resize_pvc.go:251: if _, err := clientset.CoreV1().PersistentVolumeClaims(cluster.Namespace()).Update(ctx, &pvc, metav1.UpdateOptions{}); err != nil {
We want to refactor these to clientset anyways, but I do not know if we get the same type of error here using client.Update
instead of clientset
.
./kube/helpers.go:129: result, err := ctrl.CreateOrUpdate(ctx, cl, obj, func() error {
./kube/helpers.go:195: if err := c.Update(ctx, obj); err != nil {
./actor/validate_version.go:257: if err := v.client.Update(ctx, refreshedCluster.Unwrap()); err != nil {
./actor/cluster_restart.go:144: if err := r.client.Update(ctx, refreshedCluster.Unwrap()); err != nil {
./actor/generate_cert.go:154: if err := rc.client.Update(ctx, crdbobj); err != nil {
fixed ./update/update.go:198 in https://github.com/cockroachdb/cockroach-operator/pull/597
Fixed ./actor/generate_cert.go in https://github.com/cockroachdb/cockroach-operator/pull/599
I hope that the client gives us back the same error ...
The update tests failed:
We are getting the error where the sts is out of date. We occasionally see this.
Full logs are attached.
logs_1071.zip