OpenLiberty / open-liberty

Open Liberty is a highly composable, fast to start, dynamic application server runtime environment
https://openliberty.io
Eclipse Public License 2.0
1.13k stars 578 forks source link

Liberty Server Hang randomly #24651

Closed mwangcaibmcom closed 1 year ago

mwangcaibmcom commented 1 year ago

Describe the bug
Liberty Server hang happened randomly within docker containers which are deployed in k8s cluster in an Integration Pipeline. The javacore dumps were collected to analyze the issue. In one javacore, one deadlock was identified and all operational threads are blocked due to the deadlock

2LKMONINUSE      sys_mon_t:0x00007FCE3C16F258 infl_mon_t: 0x00007FCE3C16F2D8:
3LKMONOBJECT       org/apache/cxf/jaxrs/interceptor/CachedTime@0x00000000FBF1D0C8: Flat locked by "Default Executor-thread-8" (J9VMThread:0x0000000001B4BF00), entry count 1
3LKWAITERQ            Waiting to enter:
3LKWAITER                "Default Executor-thread-1" (J9VMThread:0x00000000006EB200)
3LKWAITER                "Default Executor-thread-3" (J9VMThread:0x0000000000718D00)
3LKWAITER                "Default Executor-thread-17" (J9VMThread:0x0000000002644B00)
3LKWAITER                "Default Executor-thread-19" (J9VMThread:0x0000000000346F00)
3LKWAITER                "Default Executor-thread-20" (J9VMThread:0x0000000000618300)
3LKWAITER                "Default Executor-thread-29" (J9VMThread:0x0000000002645700)
3LKWAITER                "Default Executor-thread-30" (J9VMThread:0x0000000002643F00)
3LKWAITER                "Default Executor-thread-39" (J9VMThread:0x00000000022FF900)
3LKWAITER                "Default Executor-thread-40" (J9VMThread:0x00000000022DAA00)
3LKWAITER                "Default Executor-thread-49" (J9VMThread:0x000000000216DE00)
3LKWAITER                "Default Executor-thread-50" (J9VMThread:0x00000000022FED00)
3LKWAITER                "Default Executor-thread-59" (J9VMThread:0x0000000001B74900)
3LKWAITER                "Default Executor-thread-60" (J9VMThread:0x0000000002178F00)
3LKWAITER                "Default Executor-thread-62" (J9VMThread:0x0000000001B72300)

"Default Executor-thread-8" J9VMThread:0x0000000001B4BF00, omrthread_t:0x00007FCE18012DF0, java/lang/Thread:0x00000000FBF99E78, state:B, prio=5
       (java/lang/Thread getId:0x4C, isDaemon:true)
       com/ibm/ws/classloading/internal/ThreadContextClassLoader(0x0000000086272FF8)
       (native thread ID:0x1A4, native priority:0x5, native policy:UNKNOWN, vmstate:B, vm thread flags:0x00000281)
       (native stack address range from:0x00007FCEA0FF6000, to:0x00007FCEA1036000, size:0x40000)
      CPU usage total: 3.272702139 secs, current category="Application"
Blocked on: java/lang/StringBuffer@0x00000000FBF99F10 Owned by: "Default Executor-thread-1" (J9VMThread:0x00000000006EB200, java/lang/Thread:0x00000000804DA638)
      Heap bytes allocated since last GC cycle=0 (0x0)
      Java callstack:
          at java/lang/StringBuffer.setLength(Bytecode PC:0(Compiled Code))
             (entered lock: java/lang/StringBuffer@0x00000000FBF99F10, entry count: 1)
          at org/apache/cxf/jaxrs/interceptor/CachedTime.updateTime(CachedTime.java:86)
          at org/apache/cxf/jaxrs/interceptor/CachedTime.getTimeAsString(CachedTime.java:134)
             (entered lock: org/apache/cxf/jaxrs/interceptor/CachedTime@0x00000000FBF1D0C8, entry count: 1)
          at org/apache/cxf/jaxrs/interceptor/JAXRSOutInterceptor.setResponseDate(JAXRSOutInterceptor.java:451)
          at org/apache/cxf/jaxrs/interceptor/JAXRSOutInterceptor.prepareResponseHeaders(JAXRSOutInterceptor.java:304)
          at org/apache/cxf/jaxrs/interceptor/JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:171)
          at org/apache/cxf/jaxrs/interceptor/JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:124)
          at org/apache/cxf/jaxrs/interceptor/JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:85)
          at org/apache/cxf/phase/PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
             (entered lock: org/apache/cxf/phase/PhaseInterceptorChain@0x00000000FBF9A0D0, entry count: 1)
          at org/apache/cxf/interceptor/OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90)
          at org/apache/cxf/phase/PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
             (entered lock: org/apache/cxf/phase/PhaseInterceptorChain@0x00000000FBF9A148, entry count: 1)
          at org/apache/cxf/transport/ChainInitiationObserver.onMessage(ChainInitiationObserver.java:123)
          at org/apache/cxf/transport/http/AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:277)
          at com/ibm/ws/jaxrs20/endpoint/AbstractJaxRsWebEndpoint.invoke(AbstractJaxRsWebEndpoint.java:137)
          at com/ibm/websphere/jaxrs/server/IBMRestServlet.handleRequest(IBMRestServlet.java:146)
          at com/ibm/websphere/jaxrs/server/IBMRestServlet.doDelete(IBMRestServlet.java:131)
          at javax/servlet/http/HttpServlet.service(HttpServlet.java:712)
          at com/ibm/websphere/jaxrs/server/IBMRestServlet.service(IBMRestServlet.java:96)
          at com/ibm/ws/webcontainer/servlet/ServletWrapper.service(ServletWrapper.java:1258)
          at com/ibm/ws/webcontainer/servlet/ServletWrapper.handleRequest(ServletWrapper.java:746)
          at com/ibm/ws/webcontainer/servlet/ServletWrapper.handleRequest(ServletWrapper.java:443)
          at com/ibm/ws/webcontainer/filter/WebAppFilterChain.invokeTarget(WebAppFilterChain.java:193)
          at com/ibm/ws/webcontainer/filter/WebAppFilterChain.doFilter(WebAppFilterChain.java:98)
          at com/ibm/entity/analytics/api/rest/filter/LoggingFilterBase.doFilter(LoggingFilterBase.java:83)
          at com/ibm/ws/webcontainer/filter/FilterInstanceWrapper.doFilter(FilterInstanceWrapper.java:201)
          at com/ibm/ws/webcontainer/filter/WebAppFilterChain.doFilter(WebAppFilterChain.java:91)
          at com/ibm/ws/security/jaspi/JaspiServletFilter.doFilter(JaspiServletFilter.java:56)
          at com/ibm/ws/webcontainer/filter/FilterInstanceWrapper.doFilter(FilterInstanceWrapper.java:201)
          at com/ibm/ws/webcontainer/filter/WebAppFilterChain.doFilter(WebAppFilterChain.java:91)
          at com/ibm/ws/webcontainer/filter/WebAppFilterManager.doFilter(WebAppFilterManager.java:1004)
          at com/ibm/ws/webcontainer/filter/WebAppFilterManager.invokeFilters(WebAppFilterManager.java:1142)
          at com/ibm/ws/webcontainer/webapp/WebApp.handleRequest(WebApp.java:5078)
          at com/ibm/ws/webcontainer/osgi/DynamicVirtualHost$2.handleRequest(DynamicVirtualHost.java:316)
          at com/ibm/ws/webcontainer/WebContainer.handleRequest(WebContainer.java:1007)
          at com/ibm/ws/webcontainer/osgi/DynamicVirtualHost$2.run(DynamicVirtualHost.java:281)
          at com/ibm/ws/http/dispatcher/internal/channel/HttpDispatcherLink$TaskWrapper.run(HttpDispatcherLink.java:1246)
          at com/ibm/ws/http/dispatcher/internal/channel/HttpDispatcherLink.wrapHandlerAndExecute(HttpDispatcherLink.java:468)
          at com/ibm/ws/http/dispatcher/internal/channel/HttpDispatcherLink.ready(HttpDispatcherLink.java:427)
          at com/ibm/ws/http/channel/internal/inbound/HttpInboundLink.handleDiscrimination(HttpInboundLink.java:567)
          at com/ibm/ws/http/channel/internal/inbound/HttpInboundLink.handleNewRequest(HttpInboundLink.java:501)
          at com/ibm/ws/http/channel/internal/inbound/HttpInboundLink.processRequest(HttpInboundLink.java:361)
          at com/ibm/ws/http/channel/internal/inbound/HttpInboundLink.ready(HttpInboundLink.java:328)
          at com/ibm/ws/channel/ssl/internal/SSLConnectionLink.determineNextChannel(SSLConnectionLink.java:1129)
          at com/ibm/ws/channel/ssl/internal/SSLConnectionLink$MyReadCompletedCallback.complete(SSLConnectionLink.java:678)
          at com/ibm/ws/channel/ssl/internal/SSLReadServiceContext$SSLReadCompletedCallback.complete(SSLReadServiceContext.java:1824)
          at com/ibm/ws/tcpchannel/internal/WorkQueueManager.requestComplete(WorkQueueManager.java:514)
          at com/ibm/ws/tcpchannel/internal/WorkQueueManager.attemptIO(WorkQueueManager.java:584(Compiled Code))
          at com/ibm/ws/tcpchannel/internal/WorkQueueManager.workerRun(WorkQueueManager.java:968)
          at com/ibm/ws/tcpchannel/internal/WorkQueueManager$Worker.run(WorkQueueManager.java:1057)
          at com/ibm/ws/threading/internal/ExecutorServiceImpl$RunnableWrapper.run(ExecutorServiceImpl.java:245(Compiled Code))
          at java/util/concurrent/ThreadPoolExecutor.runWorker(Bytecode PC:94(Compiled Code))
          at java/util/concurrent/ThreadPoolExecutor$Worker.run(Bytecode PC:5)
          at java/lang/Thread.run(Bytecode PC:13)

In another javacore generated in different time when server hang, there were thousands locks of JIT-QueueSlotMonitor-<number> lock like below, which is very abnormal

2LKREGMON          JIT-QueueSlotMonitor-3735 lock (0x00007F28A82383F8): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3736 lock (0x00007F28A82384A8): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3737 lock (0x00007F28A8238558): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3738 lock (0x00007F28A8238608): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3739 lock (0x00007F28A82386B8): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3740 lock (0x00007F28A8238768): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3741 lock (0x00007F28A8238818): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3742 lock (0x00007F28A82388C8): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3743 lock (0x00007F28A8238978): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3744 lock (0x00007F28A8238A28): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3745 lock (0x00007F28A8238AD8): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3746 lock (0x00007F28A8238B88): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3747 lock (0x00007F28A8238C38): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3748 lock (0x00007F28A8238CE8): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3749 lock (0x00007F28A8238D98): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3750 lock (0x00007F28A8238E48): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3751 lock (0x00007F28A8238EF8): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3752 lock (0x00007F28A8238FA8): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3753 lock (0x00007F28A8239058): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3754 lock (0x00007F28A8239108): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3755 lock (0x00007F28A82391B8): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3756 lock (0x00007F28A8239268): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3757 lock (0x00007F28A8239318): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3758 lock (0x00007F28A82393C8): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3759 lock (0x00007F28A8239478): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3760 lock (0x00007F28A8239528): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3761 lock (0x00007F28A82395D8): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3762 lock (0x00007F28A8239688): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3763 lock (0x00007F28A8239738): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3764 lock (0x00007F28A82397E8): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3765 lock (0x00007F28A8239898): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3766 lock (0x00007F28A8239948): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3767 lock (0x00007F28A82399F8): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3768 lock (0x00007F28A8239AA8): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3769 lock (0x00007F28A8239B58): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3770 lock (0x00007F28A8239C08): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3771 lock (0x00007F28A8239CB8): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3772 lock (0x00007F28A8239D68): <unowned>
2LKREGMON          JIT-QueueSlotMonitor-3773 lock (0x00007F28A8239E18): <unowned>

Steps to Reproduce
The issue happened randomly in a secured environments and no steps to reproduce it

Expected behavior
A clear and concise description of what you expected to happen.

Diagnostic information:

Additional context

jhanders34 commented 1 year ago

For the first deadlock reported in the description it relates to Open J9 issue: https://github.com/eclipse-openj9/openj9/issues/14037. In Liberty I think we can work around this and I have started a PR #24660 to provide a workaround.

jhanders34 commented 1 year ago

This issue was originally reported with issue #12704.

mwangcaibmcom commented 1 year ago

@jhanders34 can you please let us know at what versions the issue has been fixed. we are currently at version 21.0.0.10. So, we know to what version we need to upgrade

jhanders34 commented 1 year ago

@jhanders34 can you please let us know at what versions the issue has been fixed. we are currently at version 21.0.0.10. So, we know to what version we need to upgrade

Presently Liberty 23.0.0.3 is the next release that will be made available. There isn't a guarantee that the pull request that I merged in will be included in 23.0.0.3. When the build is chosen for 23.0.0.3, and if the pull request is in that build, this issue will be marked with a release:23003 label if it is fixed in that release. I anticipate that to be what happens, but cannot guarantee or promise it.