camunda / camunda-bpm-platform

Flexible framework for workflow and decision automation with BPMN and DMN. Integration with Quarkus, Spring, Spring Boot, CDI.
https://camunda.com/
Apache License 2.0
4.11k stars 1.55k forks source link

7.20 ee main pipeline EE-engine-IT-weblogic-14R1-managed-domain-postgresql-142 flaky #3802

Closed tasso94 closed 11 months ago

tasso94 commented 1 year ago

Acceptance Criteria (Required on creation)

7.20 ee main pipeline EE-engine-IT-weblogic-14R1-managed-domain-postgresql-142 not flaky anymore.

Hints

testMultipleInvocations – org.camunda.bpm.integrationtest.functional.ejb.remote.RemoteSLSBInvocationTest ``` java.lang.IllegalStateException: Error launching test org.camunda.bpm.integrationtest.functional.ejb.remote.RemoteSLSBInvocationTest public void org.camunda.bpm.integrationtest.functional.ejb.remote.RemoteSLSBInvocationTest.testMultipleInvocations() Caused by: java.lang.IllegalStateException: Error launching request at http://10.128.41.7:27001/service/ArquillianServletRunner?outputMode=serializedObject&className=org.camunda.bpm.integrationtest.functional.ejb.remote.RemoteSLSBInvocationTest&methodName=testMultipleInvocations. No result returned ```
testInvokeBean – org.camunda.bpm.integrationtest.functional.ejb.remote.RemoteSLSBInvocationTest ``` java.lang.IllegalStateException: Error launching test org.camunda.bpm.integrationtest.functional.ejb.remote.RemoteSLSBInvocationTest public void org.camunda.bpm.integrationtest.functional.ejb.remote.RemoteSLSBInvocationTest.testInvokeBean() throws java.lang.Exception Caused by: java.lang.IllegalStateException: Error launching request at http://10.128.41.7:27001/test/ArquillianServletRunner?outputMode=serializedObject&className=org.camunda.bpm.integrationtest.functional.ejb.remote.RemoteSLSBInvocationTest&methodName=testInvokeBean. No result returned ```

Links

### Pull Requests
- [ ] https://github.com/camunda/infra-core/pull/6839

Dev2QA handover

tasso94 commented 1 year ago

Hi @yanavasileva,

Could you please look at this flaky stage according to our round-robin team agreement on flaky CI stages?

It seems like there is an Engine IT ArquillianServletRunner IllegalStateException thrown.

Best, Tassilo

yanavasileva commented 12 months ago

In traces (https://ci-pipeline.cambpm.camunda.cloud/job/7.21/job/cambpm-ee/job/cambpm-ee-main/job/master/22/):

<Oct 20, 2023, 7:16:15,641 PM Central European Summer Time> <[ACTIVE] ExecuteThread: '1' for queue: 'weblogic.kernel.Default (self-tuning)'> <> <1697822175641> <[severity-value: 64] [rid: 0] > <Initiating task for service : [Deployer:149026]deploy application service on managed_server_0..>

...

<Oct 20, 2023, 7:16:40,235 PM Central European Summer Time> <[ACTIVE] ExecuteThread: '10' for queue: 'weblogic.kernel.Default (self-tuning)'> <> <> <1697822200235> <[severity-value: 64] [rid: 0] > <Successfully completed deployment task: [Deployer:149026]deploy application test on managed_server_0..>

<Oct 20, 2023, 7:16:58,548 PM Central European Summer Time> <[ACTIVE] ExecuteThread: '10' for queue: 'weblogic.kernel.Default (self-tuning)'> <> <> <1697822218548> <[severity-value: 64] [rid: 0] > <The server "managed_server_0" disconnected from this server.>

<Oct 20, 2023, 7:16:58,549 PM Central European Summer Time> <[ACTIVE] ExecuteThread: '10' for queue: 'weblogic.kernel.Default (self-tuning)'> <> <> <1697822218549> <[severity-value: 32] [rid: 0] > <JMX Resiliency Activity Server=managed_server_0 :

Received a DISCONNECT EVENT

<Oct 20, 2023, 7:16:59,893 PM Central European Summer Time> <[ACTIVE] ExecuteThread: '1' for queue: 'weblogic.kernel.Default (self-tuning)'> <> <1697822219893> <[severity-value: 64] [rid: 0] > <Initiating task for test : [Deployer:149026]remove application test on managed_server_0..>

<Oct 20, 2023, 7:16:59,927 PM Central European Summer Time> <> <> <1697822219927> <[severity-value: 8] [rid: 0:1] > <Agreement 'managed_server_0': Error Transmitting Change*615- connection closed>

<Oct 20, 2023, 7:16:59,961 PM Central European Summer Time> <[STANDBY] ExecuteThread: '12' for queue: 'weblogic.kernel.Default (self-tuning)'> <> <> <1697822219961> <[severity-value: 64] [rid: 0] > <Deferred deployment task: [Deployer:149026]remove application test on managed_server_0.>

<Oct 20, 2023, 7:17:02,557 PM Central European Summer Time> <[ACTIVE] ExecuteThread: '10' for queue: 'weblogic.kernel.Default (self-tuning)'> <> <> <1697822222557> <[severity-value: 32] [rid: 0] > <JMX Resiliency Activity Server=managed_server_0 : Failed to reconnect after 3 attempts, disconnecting >

<Oct 20, 2023, 7:17:02,558 PM Central European Summer Time> <[ACTIVE] ExecuteThread: '10' for queue: 'weblogic.kernel.Default (self-tuning)'> <> <> <1697822222558> <[severity-value: 32] [rid: 0] >

<Oct 20, 2023, 7:17:02,559 PM Central European Summer Time> <[ACTIVE] ExecuteThread: '10' for queue: 'weblogic.kernel.Default (self-tuning)'> <> <> <1697822222559> <[severity-value: 32] [rid: 0] >

<Oct 20, 2023, 7:17:02,559 PM Central European Summer Time> <[ACTIVE] ExecuteThread: '10' for queue: 'weblogic.kernel.Default (self-tuning)'> <> <> <1697822222559> <[severity-value: 16] [rid: 0] >

<Oct 20, 2023, 7:17:46,274 PM Central European Summer Time> <> <> <1697822266274> <[severity-value: 64] [rid: 0] > <Self-tuning thread pool contains 0 running threads, 2 idle threads, and 11 standby threads>

<Oct 20, 2023, 7:17:47,308 PM Central European Summer Time> <[ACTIVE] ExecuteThread: '1' for queue: 'weblogic.kernel.Default (self-tuning)'> <> <> <1697822267308> <[severity-value: 64] [rid: 0] > <The server "managed_server_0" connected to this server.>

<Oct 20, 2023, 7:17:47,308 PM Central European Summer Time> <[STANDBY] ExecuteThread: '7' for queue: 'weblogic.kernel.Default (self-tuning)'> <> <> <1697822267308> <[severity-value: 32] [rid: 0] > <JMX Resiliency Activity Server=managed_server_0 :

Received a CONNECT EVENT

<Oct 20, 2023, 7:17:47,309 PM Central European Summer Time> <[STANDBY] ExecuteThread: '7' for queue: 'weblogic.kernel.Default (self-tuning)'> <> <> <1697822267309> <[severity-value: 32] [rid: 0] >

yanavasileva commented 12 months ago
  1. 17:16:37.011 - (19:16:37) Tests starts ci output: [2023-10-20T17:16:37.011Z] [INFO] Running org.camunda.bpm.integrationtest.functional.ejb.remote.RemoteSLSBInvocationTest 7:16:38,469 (19:16:38) EJB deployed managed_server_0.log00053: ####<Oct 20, 2023, 7:16:38,469 PM Central European Summer Time> <Info> <EJB> <weblogic-14r1-chrome-112-postgresql-150-j6v1t> <managed_server_0> <[ACTIVE] ExecuteThread: '7' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <0612e4a8-3e9e-40da-af80-8823c564a77d-000008ad> <1697822198469> <[severity-value: 64] [rid: 0] > <BEA-014021> <The EJB RemoteSLSBean(Application: service, EJBComponent: service.war) has been successfully deployed. The following remote interfaces have been bound into JNDI with the specified JNDI names:>

  2. 7:16:40,195 (19:16:40) pa deployed managed_server_0.log00053:####<Oct 20, 2023, 7:16:40,195 PM Central European Summer Time> <Info> <org.camunda.bpm.container> <weblogic-14r1-chrome-112-postgresql-150-j6v1t> <managed_server_0> <[ACTIVE] ExecuteThread: '10' for queue: 'weblogic.kernel.Default (self-tuning)'> <weblogic> <BEA1-098B745CBE7A> <0612e4a8-3e9e-40da-af80-8823c564a77d-000008b3> <1697822200195> <[severity-value: 64] [rid: 0] > <BEA-000000> <ENGINE-08023 Deployment summary for process archive 'pa': org/camunda/bpm/integrationtest/functional/ejb/remote/RemoteSLSBInvocationTest.testInvokeBean.bpmn20.xml

  3. 19:16:40 RemoteSLSBInvocationTest requests in access.log00001

  4. 19:16:43 RemoteSLSBInvocationTest requests finished

    10.128.7.6 - - [20/Oct/2023:19:16:40 +0200] "GET /service/ArquillianServletRunner?outputMode=serializedObject&amp;className=org.camunda.bpm.integrationtest.functional.ejb.remote.RemoteSLSBInvocationTest&amp;methodName=testMultipleInvocations&amp;cmd=event HTTP/1.1" 204 0 
    ...
    10.128.7.6 - - [20/Oct/2023:19:16:43 +0200] "GET /service/ArquillianServletRunner?outputMode=serializedObject&amp;className=org.camunda.bpm.integrationtest.functional.ejb.remote.RemoteSLSBInvocationTest&amp;methodName=testMultipleInvocations&amp;cmd=event HTTP/1.1" 204 0 
  5. 7:16:46,070 (19:16:46) Starting up the JobExecutor (no shut down) managed_server_0.log00053: ####<Oct 20, 2023, 7:16:46,070 PM Central European Summer Time> <Info> <org.camunda.bpm.engine.jobexecutor> <weblogic-14r1-chrome-112-postgresql-150-j6v1t> <managed_server_0> <[ACTIVE] ExecuteThread: '10' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <> <0612e4a8-3e9e-40da-af80-8823c564a77d-000008b4> <1697822206070> <[severity-value: 64] [rid: 0] > <BEA-000000> <ENGINE-14014 Starting up the JobExecutor[org.camunda.bpm.engine.impl.jobexecutor.RuntimeContainerJobExecutor].>

  6. 7:16:46,072 (19:16:46) - 7:16:48,893 (19:16:48) 79 occurrences of Enqueued request belonging to Work Manager in managed_server_0.log00053: ####<Oct 20, 2023, 7:16:48,893 PM Central European Summer Time> <Info> <WorkManager> <weblogic-14r1-chrome-112-postgresql-150-j6v1t> <managed_server_0> <[ACTIVE] ExecuteThread: '9' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <0612e4a8-3e9e-40da-af80-8823c564a77d-0000093d> <1697822208893> <[severity-value: 64] [rid: 0] > <BEA-002917> <Enqueued request belonging to Work Manager wm/camunda-bpm-workmanager, application camunda-bpm-platform-managed-domain is cancelled as the Work Manager is shutdown.>

  7. 7:16:58 (19:16:58) 'managed_server_0' process died somehow managed_server_0.out00039: <Oct 20, 2023 7:16:58 PM Central European Summer Time> <INFO> <NodeManager> <The server 'managed_server_0' with process id null is no longer alive; waiting for the process to die.> AdminServer.log00001 ####<Oct 20, 2023, 7:16:58,548 PM Central European Summer Time> <Info> <Server> <weblogic-14r1-chrome-112-postgresql-150-j6v1t> <AdminServer> <[ACTIVE] ExecuteThread: '10' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <f6b212ed-1000-4e9c-884c-378a6a9fed8a-00000631> <1697822218548> <[severity-value: 64] [rid: 0] > <BEA-002634> <The server "managed_server_0" disconnected from this server.>

  8. 7:16:58 PM (19:16:58) Starting managed_server_0 managed_server_0.out00040: Oct 20, 2023 7:16:58 PM Central European Summer Time> <INFO> <NodeManager> <Starting WebLogic server with command line: /tmp/jdk-11/bin/java -Dweblogic.Name=managed_server_0

  9. 7:17:02,750 PM (19:17:02) managed_server_0.log00054: ####<Oct 20, 2023, 7:17:02,750 PM Central European Summer Time> <Info> <WebLogicServer> <weblogic-14r1-chrome-112-postgresql-150-j6v1t> <> <Thread-6> <> <> <> <1697822222750> <[severity-value: 64] > <BEA-000377> <Starting WebLogic Server with Java HotSpot(TM) 64-Bit Server VM Version 11+28 from Oracle Corporation.>

  10. 17:20:21.659 - (19:20:21) Tests completes with two errors after (timeout of a test is set to 240 s ) and next test starts ci output: [2023-10-20T17:20:21.659Z] [ERROR] Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: **224.606 s** <<< FAILURE! - in org.camunda.bpm.integrationtest.functional.ejb.remote.RemoteSLSBInvocationTest [2023-10-20T17:20:21.660Z] [INFO] Running org.camunda.bpm.integrationtest.functional.ejb.remote.RemoteSFSBInvocationTest

Notes:

yanavasileva commented 12 months ago

I couldn't find the reason for the termination of the process. I couldn't run the tests for managed domain locally. reference The failure occurs every other time on CI for 7.20 and 7.21 in the same RemoteSLSBInvocationTest on managed domain setup after the source code level switch to Java 11.

yanavasileva commented 11 months ago

On latest occurrence of the issue RemoteSFSBInvocationTest fails in a similar way. CI build: https://ci-pipeline.cambpm.camunda.cloud/job/7.20/job/cambpm-ee/job/cambpm-ee-main/job/7.20/22/

yanavasileva commented 11 months ago

Issue occur:

image

Issue doesn't occur:

image

yanavasileva commented 11 months ago

Data with increased memory pods image image

yanavasileva commented 11 months ago

Decision: The issue didn't occur on dev after several runs so I feel confident that increasing the memory is good call. Screenshots from dev: https://github.com/camunda/infra-core/pull/6839#issuecomment-1798200770 In case the issue occur again, I think we need to investigate how to reduce the memory or understand who kills the process.