cockroachdb / cockroach-operator

k8s operator for CRDB
Apache License 2.0
282 stars 94 forks source link

Check for `IsConflict` throughout the code base when we update k8s objects #592

Open chrislovecnm opened 3 years ago

chrislovecnm commented 3 years ago

The update tests failed:

2021-06-15T18:41:51.2994080Z     logger.go:130: 2021-06-15T18:41:51.206Z    ERROR   action failed   {"CrdbCluster": "crdb-test-9bsgt2/crdb", "error": "failed to update sts with partitioned update: crdb: error applying updateStrategyFunc to crdb crdb-test-9bsgt2: Operation cannot be fulfilled on statefulsets.apps \"crdb\": the object has been modified; please apply your changes to the latest version and try again", "errorVerbose": "Operation cannot be fulfilled on statefulsets.apps \"crdb\": the object has been modified; please apply your changes to the latest version and try again\nerror applying updateStrategyFunc to crdb crdb-test-9bsgt2\ngithub.com/cockroachdb/cockroach-operator/pkg/update.UpdateClusterRegionStatefulSet\n\tpkg/update/update.go:151\ngithub.com/cockroachdb/cockroach-operator/pkg/update.updateClusterStatefulSets\n\tpkg/update/update_cockroach_version.go:138\ngithub.com/cockroachdb/cockroach-operator/pkg/update.UpdateClusterCockroachVersion\n\tpkg/update/update_cockroach_version.go:123\ngithub.com/cockroachdb/cockroach-operator/pkg/actor.(*partitionedUpdate).Act\n\tpkg/actor/partitioned_update.go:223\ngithub.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n\tpkg/controller/cluster_controller.go:130\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99\nruntime.goexit\n\tGOROOT/src/runtime/asm_amd64.s:1371\nfailed to update sts with partitioned update: crdb\ngithub.com/cockroachdb/cockroach-operator/pkg/actor.(*partitionedUpdate).Act\n\tpkg/actor/partitioned_update.go:236\ngithub.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n\tpkg/controller/cluster_controller.go:130\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99\nruntime.goexit\n\tGOROOT/src/runtime/asm_amd64.s:1371"}

We are getting the error where the sts is out of date. We occasionally see this.

Full logs are attached.
logs_1071.zip

chrislovecnm commented 3 years ago

Got this again

partitioned update: crdb: error applying updateStrategyFunc to crdb crdb-test-rm4rwc: Operation cannot be fulfilled on statefulsets.apps \"crdb\": the object has been modified; please apply your changes to the latest version and try again", "errorVerbose": "Operation cannot be fulfilled on statefulsets.apps \"crdb\": the object has been modified; please apply your changes to the latest version and try again\nerror applying updateStrategyFunc to crdb crdb-test-rm4rwc\ngithub.com/cockroachdb/cockroach-operator/pkg/update.UpdateClusterRegionStatefulSet\n\tpkg/update/update.go:151\ngithub.com/cockroachdb/cockroach-operator/pkg/update.updateClusterStatefulSets\n\tpkg/update/update_cockroach_version.go:138\ngithub.com/cockroachdb/cockroach-operator/pkg/update.UpdateClusterCockroachVersion\n\tpkg/update/update_cockroach_version.go:123\ngithub.com/cockroachdb/cockroach-operator/pkg/actor.(*partitionedUpdate).Act\n\tpkg/actor/partitioned_update.go:223\ngithub.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n\tpkg/controller/cluster_controller.go:130\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99\nruntime.goexit\n\tGOROOT/src/runtime/asm_amd64.s:1371\nfailed to update sts with partitioned update: crdb\ngithub.com/cockroachdb/cockroach-operator/pkg/actor.(*partitionedUpdate).Act\n\tpkg/actor/partitioned_update.go:236\ngithub.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n\tpkg/controller/cluster_controller.go:130\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99\nruntime.goexit\n\tGOROOT/src/runtime/asm_amd64.s:1371"}
2021-06-15T21:57:59.5722023Z     logger.go:130: 2021-06-15T21:57:59.497Z    INFO    reconciling CockroachDB c

and also got this:

2021-06-15T21:55:28.9741696Z     logger.go:130: 2021-06-15T21:55:28.895Z    INFO    Error on action {"CrdbCluster": "crdb-test-rm4rwc/crdb", "Action": "RequestCert", "err": "failed saving the annotations on request certificate: Operation cannot be fulfilled on crdbclusters.crdb.cockroachlabs.com \"crdb\": the object has been modified; please apply your changes to the latest version and try again"}
2021-06-15T21:55:28.9757070Z     logger.go:130: 2021-06-15T21:55:28.895Z    ERROR   action failed   {"CrdbCluster": "crdb-test-rm4rwc/crdb", "error": "failed saving the annotations on request certificate: Operation cannot be fulfilled on crdbclusters.crdb.cockroachlabs.com \"crdb\": the object has been modified; please apply your changes to the latest version and try again", "errorVerbose": "Operation cannot be fulfilled on crdbclusters.crdb.cockroachlabs.com \"crdb\": the object has been modified; please apply your changes to the latest version and try again
\nfailed saving the annotations on request certificate
\ngithub.com/cockroachdb/cockroach-operator/pkg/actor.(*generateCert).Act
\n\tpkg/actor/generate_cert.go:157
\ngithub.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile
\n\tpkg/controller/cluster_controller.go:130
\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297
\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252
\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2
\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215
\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1
\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185
\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155
\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil
\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156
\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99\nruntime.goexit\n\tGOROOT/src/runtime/asm_amd64.s:1371"}
chrislovecnm commented 3 years ago

We are running into eventual consistency. We need to check for

https://github.com/kubernetes/apimachinery/blob/76330795f827e44ffc47f4fdc8e0d39872bead8f/pkg/api/errors/errors.go#L492

and handle this scenario every time we are updating objects. See

https://github.com/kubernetes/kubernetes/issues/84430#issuecomment-638376994

chrislovecnm commented 3 years ago

And there is a retry for this in client.go

https://pkg.go.dev/k8s.io/client-go/util/retry#RetryOnConflict - example code from the docs

err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
    // Fetch the resource here; you need to refetch it on every try, since
    // if you got a conflict on the last update attempt then you need to get
    // the current version before making your own changes.
    pod, err := c.Pods("mynamespace").Get(name, metav1.GetOptions{})
    if err ! nil {
        return err
    }

    // Make whatever updates to the resource are needed
    pod.Status.Phase = v1.PodFailed

    // Try to update
    _, err = c.Pods("mynamespace").UpdateStatus(pod)
    // You have to return err itself here (not wrapped inside another error)
    // so that RetryOnConflict can identify it correctly.
    return err
})
if err != nil {
    // May be conflict if max retries were hit, or may be something unrelated
    // like permissions or a network error
    return err
}
chrislovecnm commented 3 years ago

We need to refactor these. And also go through the other actors as well.

./actor/cluster_restart.go:187: _, err = clientset.AppsV1().StatefulSets(stsNamespace).Update(ctx, sts, metav1.UpdateOptions{})
./actor/cluster_restart.go:218: _, err := clientset.AppsV1().StatefulSets(stsNamespace).Update(ctx, sts, metav1.UpdateOptions{})
./update/update.go:198: _, err := updateSts.clientset.AppsV1().StatefulSets(stsNamespace).Update(updateSts.ctx, sts, metav1.UpdateOptions{})
./actor/resize_pvc.go:251: if _, err := clientset.CoreV1().PersistentVolumeClaims(cluster.Namespace()).Update(ctx, &pvc, metav1.UpdateOptions{}); err != nil {

We want to refactor these to clientset anyways, but I do not know if we get the same type of error here using client.Update instead of clientset.

./kube/helpers.go:129:  result, err := ctrl.CreateOrUpdate(ctx, cl, obj, func() error {
./kube/helpers.go:195:  if err := c.Update(ctx, obj); err != nil {
./actor/validate_version.go:257: if err := v.client.Update(ctx, refreshedCluster.Unwrap()); err != nil {
./actor/cluster_restart.go:144: if err := r.client.Update(ctx, refreshedCluster.Unwrap()); err != nil {
./actor/generate_cert.go:154: if err := rc.client.Update(ctx, crdbobj); err != nil {
chrislovecnm commented 3 years ago

fixed ./update/update.go:198 in https://github.com/cockroachdb/cockroach-operator/pull/597

chrislovecnm commented 3 years ago

Fixed ./actor/generate_cert.go in https://github.com/cockroachdb/cockroach-operator/pull/599

I hope that the client gives us back the same error ...