arquillian / arquillian-cube

Control (docker, kubernetes, openshift) containers in your tests with ease!
http://arquillian.org/arquillian-cube/
120 stars 98 forks source link

Deleting environment fails with: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on #1050

Open goldmann opened 6 years ago

goldmann commented 6 years ago
Issue Overview

I see some (random) errors like this:

09:30:34  [ERROR] org.jboss.test.arquillian.ce.sso.SsoEap64SecureDeploymentsTest  Time elapsed: 1.996 s  <<< ERROR!
09:30:34  io.fabric8.kubernetes.clnt.v3_1.KubernetesClientException: Failure executing: PATCH at: https://MASTER:8443/api/v1/namespaces/test-sso-1dddeeef/replicationcontrollers/eap-app-1. Message: Forbidden! User arq-sso/MASTER:8443 doesn't have permission. replicationcontrollers "eap-app-1" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: User "arq-sso" cannot update deploymentconfigs/finalizers.apps.openshift.io in project "test-sso-1dddeeef", <nil>.
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.base.OperationSupport.requestFailure(OperationSupport.java:470)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:407)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.base.OperationSupport.handleResponse(OperationSupport.java:379)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.base.OperationSupport.handleResponse(OperationSupport.java:343)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.base.OperationSupport.handlePatch(OperationSupport.java:279)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.base.BaseOperation.handlePatch(BaseOperation.java:783)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.base.HasMetadataOperation$3.apply(HasMetadataOperation.java:156)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.base.HasMetadataOperation$3.apply(HasMetadataOperation.java:152)
09:30:34    at io.fabric8.kubernetes.api.model.v3_1.DoneableReplicationController.done(DoneableReplicationController.java:26)
09:30:34    at io.fabric8.kubernetes.api.model.v3_1.DoneableReplicationController.done(DoneableReplicationController.java:5)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.base.HasMetadataOperation.patch(HasMetadataOperation.java:163)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.internal.RollableScalableResourceOperation.patch(RollableScalableResourceOperation.java:175)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.base.HasMetadataOperation$1.apply(HasMetadataOperation.java:60)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.base.HasMetadataOperation$1.apply(HasMetadataOperation.java:50)
09:30:34    at io.fabric8.kubernetes.api.model.v3_1.DoneableReplicationController.done(DoneableReplicationController.java:26)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.internal.ReplicationControllerOperationsImpl.withReplicas(ReplicationControllerOperationsImpl.java:85)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.internal.ReplicationControllerOperationsImpl.withReplicas(ReplicationControllerOperationsImpl.java:42)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.internal.RollableScalableResourceOperation.scale(RollableScalableResourceOperation.java:77)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.internal.RollableScalableResourceOperation.scale(RollableScalableResourceOperation.java:45)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.internal.ScalableReaper.reap(ScalableReaper.java:30)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.base.BaseOperation.delete(BaseOperation.java:621)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.base.BaseOperation.delete(BaseOperation.java:70)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.base.BaseOperation.delete(BaseOperation.java:662)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.base.BaseOperation.deleteList(BaseOperation.java:696)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.base.BaseOperation.delete(BaseOperation.java:636)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.base.BaseOperation.delete(BaseOperation.java:70)
09:30:34    at io.fabric8.openshift.clnt.v3_1.dsl.internal.DeploymentConfigOperationsImpl$DeploymentConfigReaper.reap(DeploymentConfigOperationsImpl.java:133)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.base.BaseOperation.delete(BaseOperation.java:621)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.base.BaseOperation.delete(BaseOperation.java:70)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.base.BaseOperation.delete(BaseOperation.java:662)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.base.BaseOperation.delete(BaseOperation.java:650)
09:30:34    at io.fabric8.openshift.clnt.v3_1.handlers.DeploymentConfigHandler.delete(DeploymentConfigHandler.java:63)
09:30:34    at io.fabric8.openshift.clnt.v3_1.handlers.DeploymentConfigHandler.delete(DeploymentConfigHandler.java:33)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.internal.KubernetesListOperationsImpl.delete(KubernetesListOperationsImpl.java:157)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.internal.KubernetesListOperationsImpl.delete(KubernetesListOperationsImpl.java:149)
09:30:34    at io.fabric8.kubernetes.clnt.v3_1.dsl.internal.KubernetesListOperationsImpl.delete(KubernetesListOperationsImpl.java:47)
09:30:34    at org.arquillian.cube.openshift.impl.fabric8.F8OpenShiftAdapter.deleteTemplate(F8OpenShiftAdapter.java:438)
09:30:34    at org.arquillian.cube.openshift.impl.resources.OpenShiftResourceFactory.deleteTemplates(OpenShiftResourceFactory.java:230)
09:30:34    at org.arquillian.cube.openshift.impl.resources.OpenShiftResourceFactory.deleteEnvironment(OpenShiftResourceFactory.java:248)
09:30:34    at org.arquillian.cube.openshift.impl.CEEnvironmentProcessor.deleteEnvironment(CEEnvironmentProcessor.java:133)
09:30:34    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
09:30:34    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
09:30:34    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
09:30:34    at java.lang.reflect.Method.invoke(Method.java:498)
09:30:34    at org.jboss.arquillian.core.impl.ObserverImpl.invoke(ObserverImpl.java:86)
09:30:34    at org.jboss.arquillian.core.impl.EventContextImpl.invokeObservers(EventContextImpl.java:103)
09:30:34    at org.jboss.arquillian.core.impl.EventContextImpl.proceed(EventContextImpl.java:90)
09:30:34    at org.jboss.arquillian.test.impl.TestContextHandler.createClassContext(TestContextHandler.java:83)
09:30:34    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
09:30:34    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
09:30:34    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
09:30:34    at java.lang.reflect.Method.invoke(Method.java:498)
09:30:34    at org.jboss.arquillian.core.impl.ObserverImpl.invoke(ObserverImpl.java:86)
09:30:34    at org.jboss.arquillian.core.impl.EventContextImpl.proceed(EventContextImpl.java:95)
09:30:34    at org.jboss.arquillian.test.impl.TestContextHandler.createSuiteContext(TestContextHandler.java:69)
09:30:34    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
09:30:34    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
09:30:34    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
09:30:34    at java.lang.reflect.Method.invoke(Method.java:498)
09:30:34    at org.jboss.arquillian.core.impl.ObserverImpl.invoke(ObserverImpl.java:86)
09:30:34    at org.jboss.arquillian.core.impl.EventContextImpl.proceed(EventContextImpl.java:95)
09:30:34    at org.jboss.arquillian.core.impl.ManagerImpl.fire(ManagerImpl.java:133)
09:30:34    at org.jboss.arquillian.core.impl.ManagerImpl.fire(ManagerImpl.java:105)
09:30:34    at org.jboss.arquillian.test.impl.EventTestRunnerAdaptor.afterClass(EventTestRunnerAdaptor.java:95)
09:30:34    at org.jboss.arquillian.junit.Arquillian$3$1.evaluate(Arquillian.java:183)
09:30:34    at org.jboss.arquillian.junit.Arquillian.multiExecute(Arquillian.java:350)
09:30:34    at org.jboss.arquillian.junit.Arquillian.access$200(Arquillian.java:54)
09:30:34    at org.jboss.arquillian.junit.Arquillian$3.evaluate(Arquillian.java:177)
09:30:34    at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
09:30:34    at org.jboss.arquillian.junit.Arquillian.run(Arquillian.java:115)
09:30:34    at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:369)
09:30:34    at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:275)
09:30:34    at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:239)
09:30:34    at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:160)
09:30:34    at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:373)
09:30:34    at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:334)
09:30:34    at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:119)
09:30:34    at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:407)

This execution is part of CI run in Jenkins. This is not happening on my laptop. I have the exact same oc version available here and there. The OCP (3.9) environment I use is the same. The user used for the OCP env is the same and the commands executed are the same. Difference is in the host OS (Fedora locally, RHEL on CI) and in the location of these two machines.

My assumption is that tests pass, the error happens only when Cube tries to clean up the environment.

Any clue?

goldmann commented 6 years ago

The only thing that comes to my mind is that maybe the deletion of resources is somewhat too fast? When the failure occurs, both - OCP and test host is located in the same datacenter minimizing the network delays. When I execute it from my local machine, it always passes, but the test execution happens on the other end of the globe.

Additional note is that test failures are pretty much random - sometimes 2, 4 or 6 tests fail of the whol 50+ tests testsuite.

/cc @rcernich

goldmann commented 6 years ago

I played a bit with parameters and I made it work by setting the namespace.cleanup.enabled parameter to false and namespace.destroy.enabled to true. This removes the project created by test suite but does not try to remove any resources in that project upfront. These resource are removed as part of the project removal, so this configuration works for me well.

I'm not sure what to do with this ticket, should I close it or you want to investigate why it actually failed to remove resources? For me it looks like a bug and above can be seen as workaround.

rcernich commented 6 years ago

Is the issue related to the order in which the resources are deleted? It almost seems like the user is getting nuked before it removes everything else.

Regarding random failures, I think there are some things in OpenShift/k8s that occur asynchronously. If we could add some blocking logic so we wait until things are complete, that might help eliminate these random types of failures. (For example, routes are updated asynchronously and we often have issues where tests fail because the route hasn't been updated in haproxy.)

mhofstetter commented 6 years ago

i'm facing the same problem after using arquillian cube openshift with an openshift cluster in version 3.7. (didn't had the problem in 3.6)

after some debugging i think the problem is related to a race condition issue in the fabric8 kubernetes client.

After deleting (non-cascaded) the DeploymentConfig the DeploymentConfigReaper tries to delete all managed ReplicationControllers of the deleted DeploymentConfig.

The newly created ReplicationControllerOperationsImpl then performs a deletion with an implicit scaleDownToZero. This by loading the current state of the ReplicationController at the beginning of the operation (T1), performing the operation on the local stored object and than evaluates the patch. To evaluate the patch - it loads the current state of the ReplicationController again (T2).

This gets problematically when kubernetes itself modifies the related ReplicationController after the deletion of the DeploymentConfig too, by removing the 'metadata/ownerReferences' to the deleted DeploymentConfig.

If this internal modification falls between T1 and T2 the fabric8 client tries to re-add the ownerReference during the patching of the 'scaleDown' - as is detects the diff.

[
{"op":"add",
"path":"/metadata/ownerReferences",
"value":[
    {
        "apiVersion":"apps.openshift.io/v1",
        "kind":"DeploymentConfig",
        "blockOwnerDeletion":true,
        "controller":true,
        "name":"test",
        "uid":"dfa8774a-5375-11e8-8b90-0050569bf613"
    }
    ]
}
]

in conjunction with the fact, that kubernetes since 1.7 checks the ownerReference with the admission controller, this results in the error.

so maybe it's better to exclude the ownerReferences from the patch-set as it is also recommended in the kubernetes documentation.

If an object’s ownerReferences field is set by a controller (such as Deployment or ReplicaSet), blockOwnerDeletion is set automatically and you do not need to manually modify this field.

lordofthejars commented 6 years ago

So this is a f8 kubernetes client issue that should be controlled there? Maybe we could open an issue there so they are aware.

lburgazzoli commented 6 years ago

I've opened this an issue upstream as I'm also experiencing the same behaviour

gtully commented 6 years ago

looks like there is a v4.0.4 with a fix. https://github.com/fabric8io/kubernetes-client/commit/05f5c4f91ca5691691a56fddd685486e7ec644b6