Describe the bug
When upgrading from EnMasse 0.26.6->0.28.1 where the deployment has existing addressspaces/addresses, we are seeing an issue where one router pod within a statefulset is left running the old image, while all the others are successfully upgraded to the new image/config.
The pattern appears to be that if a router stateful set has n pods, n-1 pods are always upgraded successfully, one by one, rolling update fashion, but the zeroth pod for some reason is left un-upgraded.
$ oc version
oc v3.11.0+0cbc58b
kubernetes v1.11.0+d4cacc0
features: Basic-Auth
Server https://192.168.64.29:8443
kubernetes v1.11.0+d4cacc0
To Reproduce
Steps to reproduce the behavior:
Install 0.26.6
Create a addresspace using the medium addressspaceplan, create a address
Let the system stabilise. There will be one router deployment with two pods.
Update to 0.28.1 by applying the bundle.
Observe the -1 router pod is upgraded. the -0 pod sticks at the old router image.
I've observed the same pattern for configurations demanding larger numbers of routers. In each case it is always a single router that is left un-upgraded. It is always the zeroth too. When the configuration has requires one router, the problem doesn't manifest.
Expected behavior
The rolling update should, in sequence, restart each router pod, gradually bring all of them up to the 0.28.1's image/config.
Additional information
Whilst the situation persists, the agent and the un-upgrader router reports errors as the new agent fails to manage the old router.
The affected pods can be identified and restarted. Look for router pods running the old image:
oc get pod -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{range .spec.containers}}{{"\t"}}{{.image}}{{"\n"}}{{end}}{{"\n"}}{{end}}' --selector=capability=routeroc delete pod <name>
Describe the bug When upgrading from EnMasse 0.26.6->0.28.1 where the deployment has existing addressspaces/addresses, we are seeing an issue where one router pod within a statefulset is left running the old image, while all the others are successfully upgraded to the new image/config.
The pattern appears to be that if a router stateful set has
n
pods,n-1
pods are always upgraded successfully, one by one, rolling update fashion, but the zeroth pod for some reason is left un-upgraded.To Reproduce Steps to reproduce the behavior:
-1
router pod is upgraded. the-0
pod sticks at the old router image.I've observed the same pattern for configurations demanding larger numbers of routers. In each case it is always a single router that is left un-upgraded. It is always the zeroth too. When the configuration has requires one router, the problem doesn't manifest.
Expected behavior The rolling update should, in sequence, restart each router pod, gradually bring all of them up to the 0.28.1's image/config.
Additional information Whilst the situation persists, the agent and the un-upgrader router reports errors as the new agent fails to manage the old router.
2019-05-31 03:59:59.113879 +0000 AGENT (error) Error performing CREATE: Unknown attribute 'healthz' for 'org.apache.qpid.dispatch.listener'
Workaround
The affected pods can be identified and restarted. Look for router pods running the old image:
oc get pod -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{range .spec.containers}}{{"\t"}}{{.image}}{{"\n"}}{{end}}{{"\n"}}{{end}}' --selector=capability=router
oc delete pod <name>