EnMasseProject / enmasse

EnMasse - Self-service messaging on Kubernetes and OpenShift
https://enmasseproject.github.io
Apache License 2.0
190 stars 87 forks source link

"Failure executing: PATCH... Message: Unable to access invalid index: 20." seen during 0.26->0.28 upgrade #2880

Closed k-wall closed 5 years ago

k-wall commented 5 years ago

The following exception was seen once during the upgrade of an 0.26 install to 0.28.1 with an addressspace with a numnber of address types. I haven't been successful in reproducing the issue a second time. A loop formed with the same upgrade being attempted every few seconds by the controller.

The issue occurs with a standard-unlimited address space with a single standard-small-queue address.

io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PATCH at: https://172.30.0.1/apis/apps/v1/namespaces/xxxxxx-infra/statefulsets/broker-mdh1f76xc7-mqv7. Message: Unable to access invalid index: 20. Received status: Status(apiVersion=v1, code=500, details=null, kind=Status, message=Unable to access invalid index: 20, metadata=ListMeta(_continue=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, status=Failure, additionalProperties={}).
        at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:478)
        at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:417)
        at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:381)
        at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:344)
        at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handlePatch(OperationSupport.java:280)
        at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handlePatch(BaseOperation.java:811)
        at io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$patch$2(HasMetadataOperation.java:149)
        at io.fabric8.kubernetes.api.model.apps.DoneableStatefulSet.done(DoneableStatefulSet.java:27)
        at io.fabric8.kubernetes.api.model.apps.DoneableStatefulSet.done(DoneableStatefulSet.java:6)
        at io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.patch(HasMetadataOperation.java:155)
        at io.fabric8.kubernetes.client.dsl.internal.RollableScalableResourceOperation.patch(RollableScalableResourceOperation.java:171)
        at io.fabric8.kubernetes.client.dsl.internal.RollableScalableResourceOperation.patch(RollableScalableResourceOperation.java:46)
        at io.enmasse.controller.standard.KubernetesHelper.apply(KubernetesHelper.java:136)
        at io.enmasse.controller.standard.AddressController.upgradeClusters(AddressController.java:266)
        at io.enmasse.controller.standard.AddressController.onUpdate(AddressController.java:160)
        at io.enmasse.k8s.api.ResourceChecker.doWork(ResourceChecker.java:49)
        at io.enmasse.k8s.api.ResourceChecker.run(ResourceChecker.java:39)
        at java.base/java.lang.Thread.run(Thread.java:834)
k-wall commented 5 years ago

I now understand this error. The upgrader needs to add new environment variables and remove others from the broker statefulset in order to perform the update. The upgrade is done by sending a PATCH. The patch sent is an RFC 6902.

~~I believe client libraries used by EnMasse are incorrectly calculate the RFC 6902 (see attachment). ~~ The problematic part of the document is the following.

   {
      "op":"remove",
      "path":"/spec/template/spec/initContainers/0/env/19",
      "value":{
         "name":"AUTHENTICATION_SERVICE_OAUTH_URL",
         "value":"${AUTHENTICATION_SERVICE_OAUTH_URL}"
      }
   },
   {
      "op":"add",
      "path":"/spec/template/spec/initContainers/0/env/20",
      "value":{
         "name":"HOME",
         "value":"/var/run/artemis/split-1/"
      }
   },

The intent is to remove AUTHENTICATION_SERVICE_OAUTH_URL and add HOME, but it is expressed in an illegal way - causing an array out of bounds condition. Reversing the order of the two operations allows the patch to work.

rfc6902_patch_statefulset.json.txt

curl -k -v -XPATCH -H "User-Agent: oc/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0" -H "Accept: application/json" -H "Content-Type: application/json-patch+json" -H "Authorization: Bearer As8Qt7H08rOtT1WDvv-VlZ6agJQwxSgdDVnBfR3S1BQ" 'https://127.0.0.1:8443/apis/apps/v1/namespaces/amq-online-infra/statefulsets/broker-wf8m730ie4-sapa' --data-binary "@/tmp/data.json" {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unable to access invalid index: 20","code":500}

vbusch commented 5 years ago

https://github.com/flipkart-incubator/zjsonpatch is an implementation of https://tools.ietf.org/html/rfc6902#section-4.1 In the section regarding Add it says:

An element to add to an existing array - whereupon the supplied
value is added to the array at the indicated location.  Any
elements at or above the specified index are shifted one position
to the right.  The specified index MUST NOT be greater than the
number of elements in the array.  If the "-" character is used to
index the end of the array (see [RFC6901]), this has the effect of
appending the value to the array.

HOME is getting added to the 21st spot. Which makes the index 20 correct in the client side patch.

Still investigating the server side, to see why it isn't applying the patch correctly.
{"op":"move","from":"/spec/initContainers/0/env/14","path":"/spec/initContainers/0/env/12"} could be the issue.

k-wall commented 5 years ago

@vbusch Your suspicion about move being a factor is correct. I believe the issue in a flawed implementation of move on the OpenShift server side. I think anytime a PATCH include a 'move' the results could be incorrect. We may or may not see an exception. I've raised https://github.com/openshift/origin/issues/23206