eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k stars 721 forks source link

Liberty Server hang randomly #16892

Closed mwangcaibmcom closed 1 year ago

mwangcaibmcom commented 1 year ago

Java -version output

IBM Semeru Runtime Certified Edition 11.0.17.0 (build 11.0.17+8)

Summary of problem

Liberty server hang randomly in a container running on k8s/openshift cluster in an integration pipeline. Rest API calls went into the pod and the pod can process the requests, but there were no response at client side. After 2 minutes, API calls get timeout 504 error and the pod would be restarts after liveness failures(timeout) reached the threshold(10 times)

Diagnostic files

javacore dump files were generated triggered by user to diagnose the issue. We found that there are thousands JIT locks in javacore as below, please reach out to me via slack channel if you need javacore dump file.

JIT-QueueSlotMonitor-3735 lock (0x00007F28A82383F8): <unowned>
JIT-QueueSlotMonitor-3736 lock (0x00007F28A82384A8): <unowned>
JIT-QueueSlotMonitor-3737 lock (0x00007F28A8238558): <unowned>
JIT-QueueSlotMonitor-3738 lock (0x00007F28A8238608): <unowned>
JIT-QueueSlotMonitor-3739 lock (0x00007F28A82386B8): <unowned>
JIT-QueueSlotMonitor-3740 lock (0x00007F28A8238768): <unowned>
JIT-QueueSlotMonitor-3741 lock (0x00007F28A8238818): <unowned>
JIT-QueueSlotMonitor-3742 lock (0x00007F28A82388C8): <unowned>
JIT-QueueSlotMonitor-3743 lock (0x00007F28A8238978): <unowned>
JIT-QueueSlotMonitor-3744 lock (0x00007F28A8238A28): <unowned>
JIT-QueueSlotMonitor-3745 lock (0x00007F28A8238AD8): <unowned>
JIT-QueueSlotMonitor-3746 lock (0x00007F28A8238B88): <unowned>
JIT-QueueSlotMonitor-3747 lock (0x00007F28A8238C38): <unowned>
JIT-QueueSlotMonitor-3748 lock (0x00007F28A8238CE8): <unowned>
JIT-QueueSlotMonitor-3749 lock (0x00007F28A8238D98): <unowned>
JIT-QueueSlotMonitor-3750 lock (0x00007F28A8238E48): <unowned>
JIT-QueueSlotMonitor-3751 lock (0x00007F28A8238EF8): <unowned>
JIT-QueueSlotMonitor-3752 lock (0x00007F28A8238FA8): <unowned>
JIT-QueueSlotMonitor-3753 lock (0x00007F28A8239058): <unowned>
JIT-QueueSlotMonitor-3754 lock (0x00007F28A8239108): <unowned>
JIT-QueueSlotMonitor-3755 lock (0x00007F28A82391B8): <unowned>
JIT-QueueSlotMonitor-3756 lock (0x00007F28A8239268): <unowned>
JIT-QueueSlotMonitor-3757 lock (0x00007F28A8239318): <unowned>
JIT-QueueSlotMonitor-3758 lock (0x00007F28A82393C8): <unowned>
JIT-QueueSlotMonitor-3759 lock (0x00007F28A8239478): <unowned>
JIT-QueueSlotMonitor-3760 lock (0x00007F28A8239528): <unowned>
JIT-QueueSlotMonitor-3761 lock (0x00007F28A82395D8): <unowned>
JIT-QueueSlotMonitor-3762 lock (0x00007F28A8239688): <unowned>
JIT-QueueSlotMonitor-3763 lock (0x00007F28A8239738): <unowned>
JIT-QueueSlotMonitor-3764 lock (0x00007F28A82397E8): <unowned>
JIT-QueueSlotMonitor-3765 lock (0x00007F28A8239898): <unowned>
JIT-QueueSlotMonitor-3766 lock (0x00007F28A8239948): <unowned>
JIT-QueueSlotMonitor-3767 lock (0x00007F28A82399F8): <unowned>
JIT-QueueSlotMonitor-3768 lock (0x00007F28A8239AA8): <unowned>
JIT-QueueSlotMonitor-3769 lock (0x00007F28A8239B58): <unowned>
JIT-QueueSlotMonitor-3770 lock (0x00007F28A8239C08): <unowned>
JIT-QueueSlotMonitor-3771 lock (0x00007F28A8239CB8): <unowned>
JIT-QueueSlotMonitor-3772 lock (0x00007F28A8239D68): <unowned>
JIT-QueueSlotMonitor-3773 lock (0x00007F28A8239E18): <unowned>
pshipton commented 1 year ago

@0xdaryl pls take a look.

0xdaryl commented 1 year ago

I reached out via Slack to the originator to ask for the javacores.

mwangcaibmcom commented 1 year ago

Javacore is provided

0xdaryl commented 1 year ago

There is nothing unusual about the number of JIT-QueueSlotMonitor's (1883) given that there are 7 compilation threads. These are used to manage the queuing of methods for JIT compilation and from the javacore JIT compilation doesn't appear to be a problem here.

Each javacore contains a backtrace of all thread stacks. @mwangcaibmcom , can you review the application thread backtraces to see if they make sense to you? Some threads are waiting on monitors (and the javacore will tell you which ones) so it might be helpful for you to review those stacks too in the context of your application to ensure those aren't problematic.

Did this problem appear because you recently moved up to a new JVM?

mwangcaibmcom commented 1 year ago

@0xdaryl thanks for your response.

Did this problem appear because you recently moved up to a new JVM?

We did upgrade Java from open JDK version 8 to IBM Semeru v11 half year ago. I did not notice the issue happened before upgarding.

I reviewed all application threads using thread pool(thread name starting with Default Executor-thread), I only see one thread was waiting a lock as below, others were waiting for monitor in the executer pool or other feature

[2023-03-07T04:13:10.353Z] 2LKREGMON          Thread public flags mutex lock (0x00007F28F001F798): <unowned>
[2023-03-07T04:13:10.353Z] 3LKNOTIFYQ            Waiting to be notified:
[2023-03-07T04:13:10.354Z] 3LKWAITNOTIFY            "Default Executor-thread-4" (J9VMThread:0x000000000071D800)

I reviewed all javacore dumps, only this one has lot of JIT-QueueSlotMonitor locks. All javacores have 8 compilation threads like below, only JIT Compilation Thread-000 without Suspended follow up. However, only this javacores has thousand locks. It could be normal as you stated.

[2023-03-07T04:13:10.436Z] 3XMTHREADINFO      "JIT Compilation Thread-006 Suspended" J9VMThread:0x000000000002F900, omrthread_t:0x00007F2998147D20, java/lang/Thread:0x0000000080491220, state:R, prio=10
[2023-03-07T04:13:10.436Z] 3XMJAVALTHREAD            (java/lang/Thread getId:0x9, isDaemon:true)
[2023-03-07T04:13:10.436Z] 3XMJAVALTHRCCL            jdk/internal/loader/ClassLoaders$AppClassLoader(0x00000000804663D8)
[2023-03-07T04:13:10.436Z] 3XMTHREADINFO1            (native thread ID:0xA8, native priority:0xB, native policy:UNKNOWN, vmstate:CW, vm thread flags:0x00000081)
[2023-03-07T04:13:10.436Z] 3XMTHREADINFO2            (native stack address range from:0x00007F297C82E000, to:0x00007F297C92E000, size:0x100000)
[2023-03-07T04:13:10.436Z] 3XMCPUTIME               CPU usage total: 0.001356501 secs, current category="JIT"
[2023-03-07T04:13:10.436Z] 3XMHEAPALLOC             Heap bytes allocated since last GC cycle=0 (0x0)
[2023-03-07T04:13:10.436Z] 3XMTHREADINFO3           No Java callstack associated with this thread
[2023-03-07T04:13:10.436Z] 3XMTHREADINFO3           Native callstack:
[2023-03-07T04:13:10.436Z] 4XENATIVESTACK                (0x00007F299C393002 [libj9prt29.so+0x27002])
[2023-03-07T04:13:10.436Z] 4XENATIVESTACK                (0x00007F299C397863 [libj9prt29.so+0x2b863])
[2023-03-07T04:13:10.436Z] 4XENATIVESTACK                (0x00007F299C3934FA [libj9prt29.so+0x274fa])
[2023-03-07T04:13:10.436Z] 4XENATIVESTACK                (0x00007F299C397863 [libj9prt29.so+0x2b863])
[2023-03-07T04:13:10.436Z] 4XENATIVESTACK                (0x00007F299C392E94 [libj9prt29.so+0x26e94])
[2023-03-07T04:13:10.436Z] 4XENATIVESTACK                (0x00007F299C394573 [libj9prt29.so+0x28573])
[2023-03-07T04:13:10.436Z] 4XENATIVESTACK                (0x00007F299E38FCF0 [libpthread.so.0+0x12cf0])
[2023-03-07T04:13:10.436Z] 4XENATIVESTACK               pthread_cond_wait+0x1fa (0x00007F299E38B45A [libpthread.so.0+0xe45a])
[2023-03-07T04:13:10.436Z] 4XENATIVESTACK               omrthread_monitor_wait+0x287 (0x00007F299C1599C7 [libj9thr29.so+0x89c7])
[2023-03-07T04:13:10.436Z] 4XENATIVESTACK                (0x00007F299682B05E [libj9jit29.so+0x13c05e])
[2023-03-07T04:13:10.436Z] 4XENATIVESTACK                (0x00007F299683C258 [libj9jit29.so+0x14d258])
[2023-03-07T04:13:10.437Z] 4XENATIVESTACK                (0x00007F299683C2F2 [libj9jit29.so+0x14d2f2])
[2023-03-07T04:13:10.437Z] 4XENATIVESTACK                (0x00007F299C397863 [libj9prt29.so+0x2b863])
[2023-03-07T04:13:10.437Z] 4XENATIVESTACK                (0x00007F299683C722 [libj9jit29.so+0x14d722])
[2023-03-07T04:13:10.437Z] 4XENATIVESTACK                (0x00007F299C15F4F6 [libj9thr29.so+0xe4f6])
[2023-03-07T04:13:10.437Z] 4XENATIVESTACK                (0x00007F299E3851CA [libpthread.so.0+0x81ca])
[2023-03-07T04:13:10.437Z] 4XENATIVESTACK               clone+0x43 (0x00007F299DBDBE73 [libc.so.6+0x39e73])

FYI, I also opened an issue for Liberty Server team for a deadlock(https://github.com/OpenLiberty/open-liberty/issues/24651). I found 2 technotes, which might not be related for this issue(FYI) https://www.ibm.com/support/pages/apar/IJ07063 https://www.ibm.com/support/pages/apar/IV82849 I tried workaround in first technote 5 days ago, we have not found timeout issue after that, this meant server has not been hung after applying the workaround.

mwangcaibmcom commented 1 year ago

@0xdaryl any update?

0xdaryl commented 1 year ago

A few questions:

The four javacore files were taken within 6 seconds of each other. Is it possible to spread this duration out a bit (e.g., 10 seconds apart) to get a longer perspective on the hang?

What is the frequency of the hang? Does it hang every time? If not, how often?

Does this problem appear with the workaround for the Liberty deadlock in (24651)?

When you say you applied the workaround from IJ07063 and did not see a problem can you confirm that the workaround you applied was -Xjit:disableLockReservation ?

0xdaryl commented 1 year ago

Also, a core file (system dump) along with a matching javacore would also be useful. Can you generate a system dump + javacore? You may need to use com.ibm.jvm.Dump.systemDumpToFile() if you were generating these dumps programmatically.

mwangcaibmcom commented 1 year ago

I have not seen the issue happened after applying the workaround of -Xjit:disableLockReservation. The Liberty deadloack issue has been fixed in version 23.0.0.3. we plan to upgrade to the new version. After that, I will remove the workaround and monitor if the server hang happens.

As we plan to upgrade to new version, I will create a new issue if server hang happens with new version of Liberty server. I will be closing the issue now.