EnMasseProject / enmasse

EnMasse - Self-service messaging on Kubernetes and OpenShift
https://enmasseproject.github.io
Apache License 2.0
190 stars 87 forks source link

Router pods left un-upgraded after 0.26.6->0.28.1 #2873

Closed k-wall closed 5 years ago

k-wall commented 5 years ago

Describe the bug When upgrading from EnMasse 0.26.6->0.28.1 where the deployment has existing addressspaces/addresses, we are seeing an issue where one router pod within a statefulset is left running the old image, while all the others are successfully upgraded to the new image/config.

The pattern appears to be that if a router stateful set has n pods, n-1 pods are always upgraded successfully, one by one, rolling update fashion, but the zeroth pod for some reason is left un-upgraded.

$ oc version
oc v3.11.0+0cbc58b
kubernetes v1.11.0+d4cacc0
features: Basic-Auth

Server https://192.168.64.29:8443
kubernetes v1.11.0+d4cacc0

To Reproduce Steps to reproduce the behavior:

  1. Install 0.26.6
  2. Create a addresspace using the medium addressspaceplan, create a address
  3. Let the system stabilise. There will be one router deployment with two pods.
  4. Update to 0.28.1 by applying the bundle.
  5. Observe the -1 router pod is upgraded. the -0 pod sticks at the old router image.

I've observed the same pattern for configurations demanding larger numbers of routers. In each case it is always a single router that is left un-upgraded. It is always the zeroth too. When the configuration has requires one router, the problem doesn't manifest.

Expected behavior The rolling update should, in sequence, restart each router pod, gradually bring all of them up to the 0.28.1's image/config.

Additional information Whilst the situation persists, the agent and the un-upgrader router reports errors as the new agent fails to manage the old router.

2019-05-31 03:59:59.113879 +0000 AGENT (error) Error performing CREATE: Unknown attribute 'healthz' for 'org.apache.qpid.dispatch.listener'

Workaround

The affected pods can be identified and restarted. Look for router pods running the old image:

oc get pod -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{range .spec.containers}}{{"\t"}}{{.image}}{{"\n"}}{{end}}{{"\n"}}{{end}}' --selector=capability=router oc delete pod <name>

k-wall commented 5 years ago

This report seems similar: https://stackoverflow.com/questions/39298433/kubernetes-deployment-fails-to-perform-rolling-update-when-image-tag-changes

lulf commented 5 years ago

Should be fixed in #3256