Open ShuangMen opened 3 years ago
There are 2 issues found from above upgrade process.
after step 2, the statefulsets not upgrade successfully.
after step3, the scheduler's statefulset updated with incorrect and incomplete container "cc-deployment-updater-cc-deployment-updater" information, which cause the scheduler pods fail.
...
containers:
- imagePullPolicy: IfNotPresent
name: cc-deployment-updater-cc-deployment-updater
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
...
- apiVersion: apps/v1
fieldsType: FieldsV1
fieldsV1:
f:spec:
f:template:
f:spec:
f:containers:
k:{"name":"cc-deployment-updater-cc-deployment-updater"}:
.: {}
f:imagePullPolicy: {}
f:name: {}
f:resources: {}
f:terminationMessagePath: {}
f:terminationMessagePolicy: {}
manager: kubectl-patch
operation: Update
time: "2021-01-07T05:58:09Z"
...
with 'cc_deployment_updater: false', no "cc-deployment-updater-cc-deployment-updater" should be created in scheduler pod.
https://github.com/cloudfoundry-incubator/kubecf/blob/master/chart/hooks/pre-upgrade/remove-deployment-updater-readiness.sh need to add a check before doing the patch to scheduler's statefulset.
but from the cf-operator quarks-statefulset logs, there are many errors for statefulsets update
I cannot reproduce this issue; are you sure you updated cf-operator before updating kubecf, as 2.7.1 requires cf-operator 7.xx and 2.6.1 uses cf-operator 6.xx.
Anyways, your patch to check that the cf-deployment-update job exists before patching looks correct, so I've approved it, but it would be good to understand the root cause of your initial problem.
Thanks for your comments @jandubois. yes, I did upgrade the cf-operator to the related version before upgrade kubecf.
For the issue
but from the cf-operator quarks-statefulset logs, there are many errors for statefulsets update
I will do more test and investigation.
when I tried reproduce the issue from the cf-operator quarks-statefulset logs, there are many errors for statefulsets update
I installed cf-operator v6.1.17-0.gec409fd7, and then try to install kubecf v2.6.1.
there is no pod created in kubecf, I checked the log file of cf-operator and found below errors.
2021-01-12T03:53:21.328Z ERROR quarks-statefulset-reconciler quarksstatefulset/quarksstatefulset_reconciler.go:147 Could not create StatefulSet for QuarksStatefulSet 'kubecf/database': could not create or update StatefulSet 'database' for QuarksStatefulSet 'kubecf/database': Internal error occurred: failed calling webhook "mutate-statefulsets.quarks.cloudfoundry.org": Post "https://qsts-webhook.cf-operator.svc:443/mutate-statefulsets?timeout=30s": service "qsts-webhook" not found
code.cloudfoundry.org/quarks-statefulset/pkg/kube/controllers/quarksstatefulset.(*ReconcileQuarksStatefulSet).Reconcile
/go/pkg/mod/code.cloudfoundry.org/quarks-statefulset@v0.0.1/pkg/kube/controllers/quarksstatefulset/quarksstatefulset_reconciler.go:147
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:244
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:197
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/go/pkg/mod/k8s.io/apimachinery@v0.18.9/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/go/pkg/mod/k8s.io/apimachinery@v0.18.9/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/pkg/mod/k8s.io/apimachinery@v0.18.9/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
/go/pkg/mod/k8s.io/apimachinery@v0.18.9/pkg/util/wait/wait.go:90
2021-01-12T03:53:21.328Z ERROR controller controller/controller.go:246 Reconciler error {"controller": "quarks-statefulset-controller", "name": "database", "namespace": "kubecf", "error": "could not create StatefulSet for QuarksStatefulSet 'kubecf/database': could not create or update StatefulSet 'database' for QuarksStatefulSet 'kubecf/database': Internal error occurred: failed calling webhook \"mutate-statefulsets.quarks.cloudfoundry.org\": Post \"https://qsts-webhook.cf-operator.svc:443/mutate-statefulsets?timeout=30s\": service \"qsts-webhook\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:246
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:197
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/go/pkg/mod/k8s.io/apimachinery@v0.18.9/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/go/pkg/mod/k8s.io/apimachinery@v0.18.9/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/pkg/mod/k8s.io/apimachinery@v0.18.9/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
/go/pkg/mod/k8s.io/apimachinery@v0.18.9/pkg/util/wait/wait.go:90
I think we're hitting the same issue with a fresh install of 2.7.1.
Post-deploy the scheduler has 0/1 pods ready. Running kubectl decsribe
on the statefulset reveals the following:
Warning FailedCreate 12m (x40 over 138m) statefulset-controller create Pod scheduler-0 in StatefulSet scheduler failed error: Pod "scheduler-0" is invalid: spec.containers[0].image: Required value
Warning FailedCreate 4m47s (x17 over 10m) statefulset-controller create Pod scheduler-0 in StatefulSet scheduler failed error: Pod "scheduler-0" is invalid: spec.containers[0].image: Required value
And indeed when running get statefulset scheduler -n kubecf -o yaml
we can see that the first element in the containers
array is this, which is missing an image
field:
- imagePullPolicy: IfNotPresent
name: cc-deployment-updater-cc-deployment-updater
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
As a followup on the above, we removed the bad cc-deployment-updater-cc-deployment-updater
container and the scheduler StatefulSet deployed its Pod properly and things started working for us. Checking the ops-file ConfigMaps, I can see an operation for that job:
{
"type": "remove",
"path": "/instance_groups/name=scheduler/jobs/name=cc_deployment_updater"
}
so I would expect it to be removed but it appears that it is removing it enough that cf-operator
isn't adding any of the interesting bits like the image
but not removing it enough to actually drop the whole container. This is a non-HA deployment.
for the cc_deployment_updater issue, the fix is merged https://github.com/cloudfoundry-incubator/kubecf/pull/1676. it will do a check before patch to to scheduler statefulset.
Ah right. It would be good if a new release with that fix could be cut because as far as we can see, 2.7.1 doesn't actually work fully on a fresh install. Thanks for getting back to us @ShuangMen 🙂
@andy-paine There should be a new release within the next few days.
Describe the bug step 1. kubecf v2.6.1 installed in Kubernetes (version 1.19.5_1529) step 2. run helm upgrade kubecf, try to upgrade kubecf to v2.7.1, with 'cc_deployment_updater: false' in file 'config/jobs.yaml'
after upgrade, check the statefulsets:
but from the cf-operator quarks-statefulset logs, there are many errors for statefulsets update, one of them is like below.
seems the statefulsets not upgrade successfully.
step 3. run helm upgrade kubecf again and check the pod status
no "scheduler" pods there.
$ k describe statefulset scheduler-z0 -n kubecf and get below error:
check the log of cf-operator-quarks-statefulset,
To Reproduce run helm upgrade from kubecf v2.6.1 (on Kubernetes 1.19) to kubecf-v2.7.1, then run helm upgrade again.
Expected behavior helm upgrade successfully with all pods running well.
Environment