Closed tajila closed 1 year ago
With Loom support now in, are these still an issue?
@fengxue-IS Is this test passing now?
@fengxue-IS any update on this?
This test passed with -Xint, failed when JIT is enabled due to Invalid JIT return address
.
Similar failure seen on java/lang/Thread/virtual/stress/PingPong.java
on aarch64 and x86
Failed: main --enable-preview PingPong LTQ 1000000
I am working on triaging this, may need help from JIT team with understanding this cause
I was able to get a core locally by directly compiling the test class java/lang/Thread/virtual/stress/PingPong.java
and run with java --enable-preview PingPong
the invalid JIT Return address comes from GC walking JIT stacks on carrier thread after the continuation have yielded Stacktrace:
Thread 29 "ForkJoinPool-1-" hit Breakpoint 1, jitWalkStackFrames (walkState=0x7fffc8753dc0) at /root/openj9-openjdk-jdk19/openj9/runtime/codert_vm/jswalk.c:282
282 walkState->walkThread->privateFlags |= J9_PRIVATE_FLAGS_STACK_CORRUPT;
(gdb) bt
#0 jitWalkStackFrames (walkState=0x7fffc8753dc0) at /root/openj9-openjdk-jdk19/openj9/runtime/codert_vm/jswalk.c:282
#1 0x00007ffff67a46a3 in walkStackFrames (currentThread=<optimized out>, walkState=0x7fffc8753dc0) at /root/openj9-openjdk-jdk19/openj9/runtime/vm/swalk.c:345
#2 0x00007ffff44a516c in GC_VMThreadStackSlotIterator::scanSlots (vmThread=vmThread@entry=0x252200, walkThread=walkThread@entry=0x252200, userData=userData@entry=0x7fffc87540e0,
oSlotIterator=oSlotIterator@entry=0x7ffff449ca70 <stackSlotIterator(J9JavaVM*, J9Object**, void*, J9StackWalkState*, void const*)>, includeStackFrameClassReferences=<optimized out>,
trackVisibleFrameDepth=<optimized out>) at /root/openj9-openjdk-jdk19/openj9/runtime/gc_structs/VMThreadStackSlotIterator.cpp:129
#3 0x00007ffff449c532 in MM_RootScanner::scanOneThread (this=0x7fffc8754150, env=0x7fff70026948, walkThread=0x252200, localData=0x7fffc87540e0)
at /root/openj9-openjdk-jdk19/openj9/runtime/gc_base/RootScanner.cpp:542
#4 0x00007ffff449b23f in MM_RootScanner::scanThreads (this=0x7fffc8754150, env=0x7fff70026948) at /root/openj9-openjdk-jdk19/openj9/runtime/gc_base/RootScanner.cpp:508
#5 0x00007ffff449e172 in MM_RootScanner::scanRoots (this=0x7fffc8754150, env=0x7fff70026948) at /root/openj9-openjdk-jdk19/openj9/runtime/gc_base/RootScanner.cpp:972
#6 0x00007ffff4604eab in MM_MarkingDelegate::scanRoots (this=0x7ffff0047770, env=env@entry=0x7fff70026948, processLists=processLists@entry=true)
at /root/openj9-openjdk-jdk19/openj9/runtime/gc_glue_java/MarkingDelegate.cpp:310
#7 0x00007ffff457ce6c in MM_MarkingScheme::markLiveObjectsRoots (this=<optimized out>, env=env@entry=0x7fff70026948, processLists=processLists@entry=true)
at /root/openj9-openjdk-jdk19/omr/gc/base/MarkingScheme.cpp:363
#8 0x00007ffff462b8d9 in MM_ParallelMarkTask::run (this=0x7fffc8754230, env=0x7fff70026948) at /root/openj9-openjdk-jdk19/omr/gc/base/ParallelMarkTask.cpp:52
#9 0x00007ffff45951be in MM_ParallelDispatcher::run (this=0x7ffff00474c0, env=0x7fff70026948, task=0x7fffc8754230, newThreadCount=<optimized out>)
at /root/openj9-openjdk-jdk19/omr/gc/base/ParallelDispatcher.cpp:588
#10 0x00007ffff45b07d5 in MM_ParallelGlobalGC::markAll (this=this@entry=0x7ffff000b740, env=env@entry=0x7fff70026948, initMarkMap=initMarkMap@entry=true)
at /root/openj9-openjdk-jdk19/omr/gc/base/standard/ParallelGlobalGC.cpp:955
#11 0x00007ffff45b11a0 in MM_ParallelGlobalGC::mainThreadGarbageCollect (this=0x7ffff000b740, env=0x7fff70026948, allocDescription=0x7fffc8754668, initMarkMap=<optimized out>, rebuildMarkBits=<optimized out>)
at /root/openj9-openjdk-jdk19/omr/gc/base/standard/ParallelGlobalGC.cpp:461
#12 0x00007ffff45ad552 in MM_ParallelGlobalGC::internalGarbageCollect (this=<optimized out>, env=0x7fff70026948, subSpace=<optimized out>, allocDescription=0x7fffc8754668)
at /root/openj9-openjdk-jdk19/omr/gc/base/standard/ParallelGlobalGC.cpp:1234
#13 0x00007ffff4571ba6 in MM_Collector::garbageCollect (this=0x7ffff000b740, env=0x7fff70026948, callingSubSpace=0x7ffff0080c40, allocateDescription=0x7fffc8754668, gcCode=<optimized out>,
objectAllocationInterface=0x7fff70029460, baseSubSpace=0x7ffff00808b0, context=0x0) at /root/openj9-openjdk-jdk19/omr/gc/base/Collector.cpp:500
#14 0x00007ffff4626e00 in MM_MemorySubSpaceFlat::allocationRequestFailed (this=0x7ffff0080c40, env=0x7fff70026948, allocateDescription=0x7fffc8754668, allocationType=MM_MemorySubSpace::ALLOCATION_TYPE_TLH,
objectAllocationInterface=0x7fff70029460, baseSubSpace=0x7ffff00808b0, previousSubSpace=0x7ffff00808b0) at /root/openj9-openjdk-jdk19/omr/gc/base/MemorySubSpaceFlat.cpp:154
#15 0x00007ffff4628f18 in MM_MemorySubSpaceGeneric::allocateTLH (this=0x7ffff00808b0, env=0x7fff70026948, allocDescription=0x7fffc8754668, objectAllocationInterface=0x7fff70029460, baseSubSpace=0x0,
previousSubSpace=<optimized out>, shouldCollectOnFailure=true) at /root/openj9-openjdk-jdk19/omr/gc/base/MemorySubSpaceGeneric.cpp:377
#16 0x00007ffff459c9d4 in MM_TLHAllocationSupport::refresh (this=this@entry=0x7fff70029510, env=0x7fff70026948, allocDescription=allocDescription@entry=0x7fffc8754668, shouldCollectOnFailure=<optimized out>)
at /root/openj9-openjdk-jdk19/omr/gc/base/TLHAllocationSupport.cpp:210
#17 0x00007ffff459cbbe in MM_TLHAllocationSupport::allocateFromTLH (this=0x7fff70029510, env=<optimized out>, allocDescription=0x7fffc8754668, shouldCollectOnFailure=<optimized out>)
at /root/openj9-openjdk-jdk19/omr/gc/base/TLHAllocationSupport.cpp:279
#18 0x00007ffff459b51a in MM_TLHAllocationInterface::allocateObject (this=0x7fff70029460, env=0x7fff70026948, allocDescription=0x7fffc8754668, memorySpace=0x7ffff0080f90, shouldCollectOnFailure=true)
at /root/openj9-openjdk-jdk19/omr/gc/base/TLHAllocationInterface.cpp:183
#19 0x00007ffff45a179b in MM_AllocateInitialization::allocateAndInitializeObject (omrVMThread=<optimized out>, this=0x7fffc8754650) at /root/openj9-openjdk-jdk19/omr/gc/base/AllocateInitialization.hpp:201
#20 OMR_GC_AllocateObject (omrVMThread=<optimized out>, allocator=allocator@entry=0x7fffc8754650) at /root/openj9-openjdk-jdk19/omr/gc/startup/omrgcalloc.cpp:39
#21 0x00007ffff44aaee6 in J9AllocateObject (vmThread=0x252200, clazz=0x239900, allocateFlags=<optimized out>) at /root/openj9-openjdk-jdk19/openj9/runtime/gc_modron_startup/mgcalloc.cpp:416
#22 0x00007fffefb6e7bd in slow_jitNewObjectImpl (nonZeroTLH=false, checkClassInit=true, currentThread=0x252200) at /root/openj9-openjdk-jdk19/openj9/runtime/codert_vm/cnathelp.cpp:573
#23 old_slow_jitNewObject (currentThread=0x252200) at /root/openj9-openjdk-jdk19/openj9/runtime/codert_vm/cnathelp.cpp:637
#24 0x00007fffefb83471 in jitNewObject () at /root/openj9-openjdk-jdk19/build/linux-x86_64-server-release/vm/runtime/codert_vm/xnathelp.s:1265
#25 0x0000000000252200 in ?? ()
* 29 Thread 0x7fffc8757700 (LWP 13730) "ForkJoinPool-1-" jitWalkStackFrames (walkState=0x7fffc8753dc0) at /root/openj9-openjdk-jdk19/openj9/runtime/codert_vm/jswalk.c:282
The Java stack of thread 29 in DDR:
!stack 0x0024d700 !j9vmthread 0x0024d700 !j9thread 0x7fff70009788 tid 0x35a1 (13729) // (ForkJoinPool-1-worker-4)
!stack 0x00252200 !j9vmthread 0x00252200 !j9thread 0x7fff70024790 tid 0x35a2 (13730) // (ForkJoinPool-1-worker-5)
!stack 0x00255f00 !j9vmthread 0x00255f00 !j9thread 0x7fff70024d08 tid 0x35a3 (13731) // (ForkJoinPool-1-worker-6)
!stack 0x00259b00 !j9vmthread 0x00259b00 !j9thread 0x7fff5400afc0 tid 0x35a4 (13732) // (ForkJoinPool-1-worker-7)
> !stack 0x252200
<252200> known but unhandled frame type com.ibm.j9ddr.vm29.pointer.U8Pointer @ 0x00000005
FAULT FAULT FAULT FAULT FAULT FAULT FAULT FAULT FAULT FAULT FAULT FAULT FAULT FAULT FAULT FAULT
<252200> !j9method 0x0000000000220338 java/util/concurrent/ForkJoinTask.adapt(Ljava/lang/Runnable;)Ljava/util/concurrent/ForkJoinTask;
<252200> !j9method 0x0000000000094E40 java/lang/VirtualThread.submitRunContinuation(Z)V
<252200> !j9method 0x0000000000094E80 java/lang/VirtualThread.lazySubmitRunContinuation()V
<252200> !j9method 0x0000000000094F20 java/lang/VirtualThread.afterYield()V
<252200> !j9method 0x0000000000094E20 java/lang/VirtualThread.runContinuation()V
Oct 31, 2022 2:55:27 PM com.ibm.j9ddr.vm29.events.DefaultEventListener corruptData
WARNING: CorruptDataException thrown walking stack. walkThread = 0x0000000000252200
com.ibm.j9ddr.AddressedCorruptDataException: Invalid JIT return address
Update
It seems that PingPong test failure is intermittent. using TimedGet
produced a more consistent failure for triaging.
I noticed that there seem to be two distinct issues with TimedGet
, with JIT enabled, I would see failure due to
Invalid JIT return address
I will try to see if this can be narrowed down to where the PC is being set on stack.
When testing with -Xint
there is an issue with corrupted stack object. This seem to be only an issue with gencon
, running the test under -Xgcpolicy:balanced / optthruput
both successfully completed
Discussed this with @LinHu2016 , and he will be investigating this case.
it looks similar issue as https://github.com/eclipse-openj9/openj9/issues/16212
@nbhuiyan : please investigate this from a JIT perspective.
The original timeout no longer occurs. Currently, there are two failures associated to this test:
I discussed with @fengxue-IS about the JIT invalid return address issue running the PingPong test. Jack reported that running with -Xjit:disableInlining,optLevel=noOpt
prevented the crash from happening. I was able to reproduce the test failure locally at around the same failure rate that Jack reported earlier, using the command java --enable-preview -Xmx1g PingPong LTQ 1000000
.
-Xjit:disableInlining,optLevel=noOpt
preventing the failure from occurring does not necessarily mean that the failure was caused by a bad optimization or something going wrong during inlining. It seems that just adding JIT options significantly reduces the failure rate. I experimented with -Xjit options that are specific to a different platform and would not impact the compilation behaviour on x86, such as -Xjit:TR_DisableZ10
, and that was sufficient to reduce the failure rate from 50% to less than 1% (verified using a bash for
loop to run the test repeatedly). I needed to obtain verbose logs to get a list of methods compiled for a failed run, and then try narrowing down to the smallest set of method(s) needed to be compiled to reproduce the problem. I had to increase the iteration option of the PingPong test to a ridiculously large number and that seems to bring the failure rate to a reasonable level.
I have obtained a JIT verbose log from a failing run and that contained some of the methods seen in the strack trace from investigating a core dump. I was trying to narrow down to the smallest set of methods that need to be compiled in order for the invalid return address to occur, so that I can focus on what happens during the compilation of those method(s). However, using limit file to try various subsets of methods causes the problem to hide, so I have yet to figure out a way to progress further with that. I have also tried the GC race condition fix in #16290, and still saw the problem happening.
@nbhuiyan Do you think this will be resolved within 2 weeks?
@tajila
Do you think this will be resolved within 2 weeks?
It's possible, depending on what the actual cause is. Examining the JIT logs of the methods leading up to the frame where invalid JIT return address occurs during a failed run does not show anything going wrong there, so I am starting to somewhat suspect that the problem may not be JIT-related.
Some updates with this investigation:
ForkJoinTask.doExec
) at the top.ForkJoinTask.doExec
prevented from getting compiled, but this time we see a failure elsewhere, and get either a segfault or another JIT invalid return address error:Segfault case, where we end up in the invalid JIT code cache region:
Unhandled exception
Type=Segmentation error vmState=0x00000000
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
Handler1=00007FC61DCB0F10 Handler2=00007FC61D4B41D0 InaccessibleAddress=0000000000000000
RDI=000000000000001F RSI=0000000000000001 RAX=0000000000000000 RBX=0000000000217B00
RCX=0000000000000020 RDX=00000000003D9BE1 R8=0000000000000001 R9=00000000FFEDEF88
R10=000000000021CA78 R11=00000000C0092938 R12=00000000FFEDEFE8 R13=0000000000092B88
R14=00000000C0000000 R15=0000000000227600
RIP=00007FC5FC4EC0CB GS=0000 FS=0000 RSP=000000000021CB20
EFlags=0000000000010206 CS=0033 RBP=000000000023C000 ERR=0000000000000006
TRAPNO=000000000000000E OLDMASK=0000000000000000 CR2=0000000000000000
xmm0 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm1 00000000003d9be0 (f: 4037600.000000, d: 1.994839e-317)
xmm2 000000000021caf8 (f: 2214648.000000, d: 1.094181e-317)
xmm3 000000000021cc99 (f: 2215065.000000, d: 1.094388e-317)
xmm4 00007fc61de37cca (f: 501447872.000000, d: 6.941073e-310)
xmm5 0000000000000038 (f: 56.000000, d: 2.766768e-322)
xmm6 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm7 00000000c00c7dc0 (f: 3222044160.000000, d: 1.591901e-314)
xmm8 00000000001d35e8 (f: 1914344.000000, d: 9.458116e-318)
xmm9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm10 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm11 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
Compiled_method=unknown (In JIT code segment 00007FC6180C7F78 but no method found)
JIT invalid return address case, where we fail during stackwalk:
*** Invalid JIT return address 00007FD2434CD649 in 00007FD240472440
Looking at how we end up with the invalid JIT return address:
...
#14 0x00007fd264e208f3 in invalidJITReturnAddress (
walkState=<optimized out>)
at /root/builds/openj9-openjdk-jdk19/openj9/runtime/vm/swalk.c:1632
#15 0x00007fd25f4a74b5 in jitWalkStackFrames (
walkState=0x7fd240472440)
at /root/builds/openj9-openjdk-jdk19/openj9/runtime/codert_vm/jswalk.c:283
#16 0x00007fd264e1f6c5 in walkStackFrames (
currentThread=<optimized out>, walkState=0x7fd240472440)
at /root/builds/openj9-openjdk-jdk19/openj9/runtime/vm/swalk.c:384
#17 0x00007fd25ec7f8a1 in jitMethodSampleInterrupt (
vmThread=0x259d00, handlerKey=<optimized out>,
userData=<optimized out>)
at /root/builds/openj9-openjdk-jdk19/openj9/runtime/compiler/control/HookedByTheJit.cpp:1278
#18 0x00007fd264dde507 in dispatchAsyncEvents (
currentThread=0x259d00, asyncEventFlags=1)
at /root/builds/openj9-openjdk-jdk19/openj9/runtime/vm/hookableAsync.c:238
#19 0x00007fd264db5de8 in javaCheckAsyncMessages (
currentThread=0x259d00, throwExceptions=0)
at /root/builds/openj9-openjdk-jdk19/openj9/runtime/vm/AsyncMessageHandler.cpp:61
#20 0x00007fd25f4931e5 in old_slow_jitCheckAsyncMessages (
currentThread=0x259d00)
at /root/builds/openj9-openjdk-jdk19/openj9/runtime/codert_vm/cnathelp.cpp:1431
#21 0x00007fd25f4aa8d6 in jitCheckAsyncMessages ()
at /root/builds/openj9-openjdk-jdk19/build/linux-x86_64-server-release/vm/runtime/codert_vm/xnathelp.s:1780
#22 0x0000000000259d00 in ?? () <--- J9VMThread
#23 0x00007fd240472b50 in ?? () <--- Ptr Obj - {java/util/concurrent/ForkJoinTask$AdaptedRunnableAction} +8
#24 0x00000000ffe52a90 in ?? () <--- Obj - {java/util/concurrent/ForkJoinTask$AdaptedRunnableAction}
#25 0x0000000000217700 in ?? () <--- J9Class - {java/lang/VirtualThread$$Lambda$8/0x0000000000000000}
#26 0x0000000000071c1f in ?? () <--- CP - {java/lang/Module} + 1551
#27 0x0000000000227600 in ?? () <--- J9Class - {java/util/concurrent/ForkJoinTask}
#28 0x0000000000259d00 in ?? () <--- J9VMThread
#29 0x0000000000010000 in ?? () <--- unknown
#30 0x00007fd264f63cca in j9_void_void ()
from /root/builds/openj9-openjdk-jdk19/build/linux-x86_64-server-release/images/jdk/lib/default/libj9vm29.so
#31 0x0000000000000000 in ?? ()
The stack walker is triggered through jitCheckAsyncMessages
, and the J9VMThread
is 0x259d00
. Here is what we have in that J9VMThread at the time of the crash:
J9VMThread at 0x259d00 {
Fields for J9VMThread:
0x0: struct JNINativeInterface_* functions = !jninativeinterface_ 0x00007FD2651A6B60
0x8: struct J9JavaVM* javaVM = !j9javavm 0x00007FD26000FA80
0x10: UDATA* arg0EA = !j9x 0x0000000000259A98
0x18: UDATA* bytecodes = !j9x 0x0000000000000000
0x20: UDATA* sp = !j9x 0x0000000000259A78
0x28: U8* pc = !j9x 0x0000000000000005
0x30: struct J9Method* literals = !j9method 0x0000000000000000
0x38: UDATA jitStackFrameFlags = 0x0000000000000000 (0)
0x40: struct J9Object* jitException = !j9object 0x0000000000000000
0x48: struct J9Object* currentException = !j9object 0x0000000000000000
0x50: UDATA* stackOverflowMark = !j9x 0x0000000000259450
0x58: UDATA* stackOverflowMark2 = !j9x 0x0000000000259450
0x60: U8* heapAlloc = !j9x 0x00000000FFE52AD8 // ""
0x68: U8* heapTop = !j9x 0x00000000FFE53018 // "���"
0x70: IDATA tlhPrefetchFTA = 0x0000000000000000 (0)
0x78: U8* nonZeroHeapAlloc = !j9x 0x0000000000000000
0x80: U8* nonZeroHeapTop = !j9x 0x0000000000000000
0x88: IDATA nonZeroTlhPrefetchFTA = 0x0000000000000000 (0)
0x90: struct J9ThreadMonitor* publicFlagsMutex = !j9threadmonitor 0x00007FD1D8009428
0x98: UDATA publicFlags = 0x0000000000000020 (32)
0xa0: UDATA compressObjectReferences = 0x0000000000000001 (1)
0xa8: struct J9Object* threadObject = !j9object 0x00000000C00A3610 // jdk/internal/misc/CarrierThread
0xb0: void* lowTenureAddress = !j9x 0x00000000C0000000
0xb8: void* highTenureAddress = !j9x 0x00000000C0600000
0xc0: void* heapBaseForActiveCardTable = !j9x 0x0000000000000000
0xc8: void* activeCardTableBase = !j9x 0x00007FD25C93C000
0xd0: UDATA heapSizeForActiveCardTable = 0x0000000000000000 (0)
0xd8: void* heapBaseForBarrierRange0 = !j9x 0x00000000C0000000
0xe0: UDATA heapSizeForBarrierRange0 = 0x0000000000600000 (6291456)
0xe8: UDATA* jniLocalReferences = !j9x 0x0000000000000000
0xf0: UDATA tempSlot = 0x00007FD25F4B4810 (140541518628880)
0xf8: void* jitReturnAddress = !j9x 0x00007FD2434CD649
0x100: void* floatTemp1 = !j9x 0x00007FD2434CD545
0x108: void* floatTemp2 = !j9x 0x00000000FFE90018
0x110: void* floatTemp3 = !j9x 0x00007FD2434BF0D4
0x118: void* floatTemp4 = !j9x 0x0000000000000000
0x120: UDATA returnValue = 0x00007FD280000000 (140542067343360)
0x128: UDATA returnValue2 = 0x0000000000000001 (1)
0x130: UDATA* objectFlagSpinLockAddress = !j9x 0x0000000000000000
0x138: struct J9JavaStack* stackObject = !j9javastack 0x0000000000256C50
0x140: struct J9Thread* osThread = !j9thread 0x00007FD1E4010540
0x148: UDATA inspectionSuspendCount = 0x0000000000000000 (0)
0x150: UDATA inspectorCount = 0x0000000000000000 (0)
0x158: U32 eventFlags = 0x00000010 (16)
0x15c: U32 osrFrameIndex = 0x00000000 (0)
0x160: void* codertTOC = !j9x 0x0000000000000000
0x168: U8* cardTableVirtualStart = !j9x 0x00007FD25C33C000 // "9ddr/vm29/pointer/StructurePointer"
0x170: struct J9Object* stopThrowable = !j9object 0x0000000000000000
0x178: struct J9Object* outOfMemoryError = !j9object 0x00000000C00A3698 // java/lang/OutOfMemoryError
0x180: UDATA* jniCurrentReference = !j9x 0x0000000000000000
0x188: UDATA* jniLimitReference = !j9x 0x0000000000000000
0x190: struct J9VMThread* linkNext = !j9vmthread 0x000000000025DA00
0x198: struct J9VMThread* linkPrevious = !j9vmthread 0x0000000000233500
0x1a0: UDATA privateFlags = 0x0000000000040002 (262146)
0x1a8: UDATA jitTOC = 0x0000000000000000 (0)
0x1b0: UDATA ferReturnType = 0x0000000000000000 (0)
0x1b8: U64 ferReturnValue = 0x0000000000000000 (0)
0x1c0: U64 mgmtBlockedTimeTotal = 0x0000000000000000 (0)
0x1c8: U64 mgmtBlockedTimeStart = 0x0000000000000000 (0)
0x1d0: U64 mgmtWaitedTimeTotal = 0x0000000000000000 (0)
0x1d8: U64 mgmtWaitedTimeStart = 0x0000000000000000 (0)
0x1e0: UDATA jniVMAccessCount = 0x0000000000000000 (0)
0x1e8: UDATA debugEventData1 = 0x0000000000000000 (0)
0x1f0: UDATA debugEventData2 = 0x0000000000000000 (0)
0x1f8: UDATA debugEventData3 = 0x0000000000000000 (0)
0x200: UDATA debugEventData4 = 0x0000000000000000 (0)
0x208: UDATA debugEventData5 = 0x0000000000000000 (0)
0x210: UDATA debugEventData6 = 0x0000000000000000 (0)
0x218: UDATA debugEventData7 = 0x0000000000000000 (0)
0x220: UDATA debugEventData8 = 0x0000000000000000 (0)
0x228: struct J9StackElement* classLoadingStack = !j9stackelement 0x0000000000000000
0x230: UDATA jitTransitionJumpSlot = 0x0000000000000000 (0)
0x238: struct J9ThreadMonitor* gcClassUnloadingMutex = !j9threadmonitor 0x0000000000000000
0x240: struct J9VMThread* gcClassUnloadingThreadPrevious = !j9vmthread 0x0000000000000000
0x248: struct J9VMThread* gcClassUnloadingThreadNext = !j9vmthread 0x0000000000000000
0x250: struct J9StackWalkState* stackWalkState = !j9stackwalkstate 0x000000000025A000
0x258: struct J9VMEntryLocalStorage* entryLocalStorage = !j9vmentrylocalstorage 0x00007FD240472B70
0x260: UDATA gpProtected = 0x0000000000000001 (1)
0x268: struct J9VMGCSublistFragment gcRememberedSet = !j9vmgcsublistfragment 0x0000000000259F68
0x298: struct MM_GCRememberedSetFragment sATBBarrierRememberedSetFragment = !mm_gcrememberedsetfragment 0x0000000000259F98
0x2c8: void* gcTaskListPtr = !j9x 0x0000000000000000
0x2d0: UDATA* dropBP = !j9x 0x0000000000000000
0x2d8: UDATA dropFlags = 0x0000000000000000 (0)
0x2e0: struct J9Pool* monitorEnterRecordPool = !j9pool 0x00007FD1E40116F0
0x2e8: struct J9MonitorEnterRecord* monitorEnterRecords = !j9monitorenterrecord 0x0000000000000000
0x2f0: UDATA* jniArrayCache = !j9x 0x0000000000000000
0x2f8: UDATA* jniArrayCache2 = !j9x 0x0000000000000000
0x300: struct J9StackWalkState inlineStackWalkState = !j9stackwalkstate 0x000000000025A000
0x538: struct J9JITDecompilationInfo* decompilationStack = !j9jitdecompilationinfo 0x0000000000000000
0x540: struct J9ModronThreadLocalHeap allocateThreadLocalHeap = !j9modronthreadlocalheap 0x000000000025A240
0x570: struct J9ModronThreadLocalHeap nonZeroAllocateThreadLocalHeap = !j9modronthreadlocalheap 0x000000000025A270
0x5a0: void* sidecarEvent = !j9x 0x0000000000000000
0x5a8: struct PortlibPTBuffers_struct* ptBuffers = !portlibptbuffers_ 0x0000000000000000
0x5b0: struct J9Object* blockingEnterObject = !j9object 0x0000000000000000
0x5b8: void* gcExtensions = !j9x 0x00007FD1E4012878
0x5c0: void* gpInfo = !j9x 0x0000000000000000
0x5c8: void* jitVMwithThreadInfo = !j9x 0x00007FD1D4003210
0x5d0: U8* profilingBufferEnd = !j9x 0x00007FD1D00031C0 // "!�f"�
0x5d8: U8* profilingBufferCursor = !j9x 0x00007FD1D0002E32 // "���@�"
0x5e0: UDATA* j2iFrame = !j9x 0x0000000000000000
0x5e8: UDATA currentOSStackFree = 0x000000000007CBD0 (510928)
0x5f0: UDATA mgmtBlockedCount = 0x0000000000000000 (0)
0x5f8: UDATA mgmtWaitedCount = 0x0000000000000230 (560)
0x600: UDATA mgmtBlockedStart = 0x0000000000000000 (0)
0x608: UDATA mgmtWaitedStart = 0x0000000000000000 (0)
0x610: UDATA cardTableShiftSize = 0x0000000000000009 (9)
0x618: void* aotVMwithThreadInfo = !j9x 0x0000000000000000
0x620: UDATA asyncEventFlags = 0x0000000000000000 (0)
0x628: struct J9Object* forceEarlyReturnObjectSlot = !j9object 0x0000000000000000
0x630: struct J9MonitorEnterRecord* jniMonitorEnterRecords = !j9monitorenterrecord 0x0000000000000000
0x638: struct J9DLTInformationBlock dltBlock = !j9dltinformationblock 0x000000000025A338
0x7f8: struct J9VMGCSegregatedAllocationCacheEntry* segregatedAllocationCache = !j9vmgcsegregatedallocationcacheentry 0x0000000000000000
0x800: struct J9StackWalkState* activeWalkState = !j9stackwalkstate 0x00007FD240472440
0x808: void* jniCalloutArgs = !j9x 0x0000000000000000
0x810: struct J9VMThread* exclusiveVMAccessQueueNext = !j9vmthread 0x0000000000000000
0x818: struct J9VMThread* exclusiveVMAccessQueuePrevious = !j9vmthread 0x0000000000000000
0x820: struct J9Object* javaLangThreadLocalCache = !j9object 0x0000000000000000
0x828: UDATA jitCountDelta = 0x0000000000000002 (2)
0x830: UDATA maxProfilingCount = 0x0000000000001771 (6001)
0x838: j9objectmonitor_t[] objectMonitorLookupCache = !j9x 0x000000000025A538
0x938: UDATA jniCriticalCopyCount = 0x0000000000000000 (0)
0x940: UDATA jniCriticalDirectCount = 0x0000000000000000 (0)
0x948: struct J9Pool* jniReferenceFrames = !j9pool 0x0000000000000000
0x950: U32 ludclInlineDepth = 0x00000000 (0)
0x954: U32 ludclBPOffset = 0x00000000 (0)
0x958: UDATA* osrJittedFrameCopy = !j9x 0x0000000000000000
0x960: struct J9OSRBuffer* osrBuffer = !j9osrbuffer 0x0000000000000000
0x968: void* osrReturnAddress = !j9x 0x0000000000000000
0x970: void* osrScratchBuffer = !j9x 0x0000000000000000
0x978: void* jitArtifactSearchCache = !j9x 0x00007FD200009B00
0x980: void* jitExceptionHandlerCache = !j9x 0x0000000000000000
0x988: void* jitPrivateData = !j9x 0x00007FD240487EC0
0x990: struct J9Method* jitMethodToBeCompiled = !j9method 0x0000000000000000
0x998: UDATA privateFlags2 = 0x0000000000000000 (0)
0x9a0: struct OMR_VMThread* omrVMThread = !omr_vmthread 0x000000000025A748
0x9a8: void* gpuInfo = !j9x 0x0000000000000000
0x9b0: void* startOfMemoryBlock = !j9x 0x0000000000259CA8
0x9b8: UDATA inNative = 0x0000000000000000 (0)
0x9c0: struct J9JITDecompilationInfo* lastDecompilation = !j9jitdecompilationinfo 0x0000000000000000
0x9c8: struct J9GSParameters gsParameters = !j9gsparameters 0x000000000025A6C8
0x9d8: UDATA readBarrierRangeCheckBase = 0xFFFFFFFFFFFFFFFF (-1)
0x9e0: UDATA readBarrierRangeCheckTop = 0x0000000000000000 (0)
0x9e8: U32 readBarrierRangeCheckBaseCompressed = 0xFFFFFFFF (4294967295)
0x9ec: U32 readBarrierRangeCheckTopCompressed = 0x00000000 (0)
0x9f0: UDATA safePointCount = 0x0000000000000000 (0)
0x9f8: volatile struct J9HashTable* utfCache = !j9hashtable 0x0000000000000000
0xa00: U64* ffiArgs = !j9x 0x0000000000000000
0xa08: UDATA ffiArgCount = 0x0000000000000000 (0)
0xa10: struct J9VMContinuation* currentContinuation = !j9vmcontinuation 0x0000000000000000
0xa18: UDATA continuationPinCount = 0x0000000000000000 (0)
0xa20: UDATA ownedMonitorCount = 0x0000000000000000 (0)
0xa28: UDATA callOutCount = 0x0000000000000000 (0)
0xa30: struct J9Object* carrierThreadObject = !j9object 0x00000000C00A3610 // jdk/internal/misc/CarrierThread
0xa38: struct J9Object* extentLocalCache = !j9object 0x0000000000000000
}
In the J9VMThread, we see the JIT return address set as: 0x00007FD2434CD649
, which is a region in the JIT code cache that is invalid. What I am now trying to understand now is what sets that field in the J9VMThread to this value, and where this is obtained from, and what it is supposed to be.
What I am now trying to understand now is what sets that field in the J9VMThread to this value, and where this is obtained from, and what it is supposed to be.
While mounting a virtual thread, we copy the J9VMContinuation fields into J9VMThread and store the carrier thread fields in J9VMContinuation. While unmounting a virtual thread, we restore the carrier thread state by again swapping values into J9VMThread and caching the virtual thread specific-details in J9VMContinuation. We may need to cache the jitReturnAddress and other similar JIT fields in J9VMContinuation, which may need to be swapped during virtual thread mount and unmount. J9VMContinuation contains fields that are thread-specific and cannot be shared between a virtual thread and its carrier thread. See code associated to the below function for the above operations. Currently, jitReturnAddress is not part of the swap.
Ah, I see now - jitReturnAddress
is used while in JIT helpers (cnathelp.cpp
). I do not believe it's relevant to transitions. I've done a search and don't see it used anywhere else.
Even in the helpers, the value would be stored into a resolve frame before any potential calls into java could occur.
You could try swapping it to see if the assertion goes away, but I do not believe it will have any effect.
@nbhuiyan : from the crash where you extracted the info above, can you prove whether the pop uword ptr J9TR_VMThread_jitReturnAddress[_rbp]
wrote a correct value for the jitCheckAsyncMessages
call? If yes then we're looking for something else to have stomped on that field.
I think prototyping adding a field to the J9VMContinuation
struct and swapping the JIT return address should be easy enough to do. Could you prototype that just to rule that out as a cause?
jitCheckAsyncMessages
can not run java code, so it is not a swap point for continuations.
I think prototyping adding a field to the J9VMContinuation struct and swapping the JIT return address should be easy enough to do. Could you prototype that just to rule that out as a cause?
@nbhuiyan do you have an update
@tajila Yes, I have prototyped adding a jitReturnAddress field to J9VMContinuation struct, and using that to swap the JIT return address during virtual thread mounting and unmounting. I confirmed that the J9VMThread had jitReturnAddress field pointing to a valid address, however the stackWalkState
still contained the invalid address that resulted in the assertion failure.
This is what I expected.
Using -verbose:stackwalk, I was able to track down what the invalid address belonged to, and it was the PC for method ForkJoinPool$WorkQueue::topLevelExec
. At the time of the crash, the address was invalid.
Here is a part of the list of compiled methods obtained from a core dump using jdmpview:
start=0x7fba8126d648 end=0x7fba8126e025 java/util/concurrent/ForkJoinPool$WorkQueue::topLevelExec(Ljava/util/concurrent/ForkJoinTask;Ljava/util/concurrent/ForkJoinPool$WorkQueue;)V
start=0x7fba81292708 end=0x7fba81293221 java/util/concurrent/ForkJoinPool$WorkQueue::topLevelExec(Ljava/util/concurrent/ForkJoinTask;Ljava/util/concurrent/ForkJoinPool$WorkQueue;)V
There are similar methods with more than one entry in the list, such as ForkJoinPool.scan
. When checking the core dump for what’s within the address range of the first entry of a JIT-ed method, it is invalid, but for the second entry’s address range, it is a valid address for the compiled method, meaning the second entry is due to recompilation, as confirmed by the JIT verbose log:
! (AOT load) java/util/concurrent/ForkJoinPool$WorkQueue.topLevelExec(Ljava/util/concurrent/ForkJoinTask;Ljava/util/concurrent/ForkJoinPool$WorkQueue;)V Q_SZ=42 Q_SZI=42 QW=180 j9m=00000000001FF940 time=107us compilationRelocationFailure (classByNameValidationFailure) memLimit=262144 KB compThreadID=0
+ (cold) java/util/concurrent/ForkJoinPool$WorkQueue.topLevelExec(Ljava/util/concurrent/ForkJoinTask;Ljava/util/concurrent/ForkJoinPool$WorkQueue;)V @ 00007FBA8126E088-00007FBA8126E245 OrdinaryMethod - Q_SZ=42 Q_SZI=42 QW=180 j9m=00000000001FF940 bcsz=83 GCR compThreadID=0 CpuLoad=9%(1%avg) JvmCpu=0%
+ (warm) java/util/concurrent/ForkJoinPool$WorkQueue.topLevelExec(Ljava/util/concurrent/ForkJoinTask;Ljava/util/concurrent/ForkJoinPool$WorkQueue;)V @ 00007FBA81292708-00007FBA81293221 OrdinaryMethod C Q_SZ=0 Q_SZI=0 QW=12 j9m=00000000001FF940 bcsz=83 compThreadID=0 CpuLoad=0%(0%avg) JvmCpu=56%
The warm compilation of topLevelExec
was also the final entry in the verbose log, meaning this was the last method to successfully compile prior to the crash.
So for the crash, we have
*** Invalid JIT return address 00007FBA8126E105 in 00007FBA7AA4C6B0
and the jit return address is immediately after the JIT-ed method address range of the first entry of topLevelExec. In the core dump, that address was invalid, but within the range of the JIT code cache. The stack walk results in an error due to this invalid address in 00007FBA7AA4C6B0, which is the walkState object.
Here is the info we get using KCA on the J9Method:
Method {ClassPath/Name.MethodName}: {java/util/concurrent/ForkJoinPool$WorkQueue.topLevelExec}
Signature: (Ljava/util/concurrent/ForkJoinTask;Ljava/util/concurrent/ForkJoinPool$WorkQueue;)V
Access: Final
J9Class/J9Method: 0x00000000001ff500 / 0x00000000001ff940
Compiled Method Start: 0x00007fba81292708 (2909 bytes)
ByteCode Start: 0x00007fba582ce73c (83 bytes)
ROM Constant Pool: 0x00007fba582cdb30 (79 entries)
Constant Pool: 0x00000000001feec0 (79 entries)
As you can see, the compiled method start is the same value as the compiled method start of the second entry of topLevelExec reported by jdmpview. So somehow the walkState object needed to be updated with the new address for topLevelExec, but that’s what is not happening.
The failure is relatively reproducible (I'd say 9/10 times with Nazim's setup). I'll just record an observation here while the investigation proceeds.
Just prior (I don't mean immediately before but very recently before) the failing stack walk we performed a successful stack walk on the same J9VMThread with the same walkState. The J9VMThread is not of the typical 256-byte aligned variety and doesn't show up in the thread list (address is 00007FA366682C40) but other than that appears to be formed correctly. The diffs of those stackwalks are nearly identical until the crash point (except for addresses of some objects). In both cases the method on top of the stack is jdk/internal/vm/Continuation.enterImpl()Z
but the receivers are slightly different. The object in the failing walk is linked to in the continuationLink
field of the first object.
First walk (successful)
!J9Object 0x00000000C00C03A0 {
struct J9Class* clazz = !j9class 0x216F00 // java/lang/VirtualThread$VThreadContinuation
Object flags = 0x00000010;
I lockword = 0x00000000 (offset = 0) (java/lang/Object) <hidden>
J vmRef = 0x00007FA388271090 (offset = 4) (jdk/internal/vm/Continuation)
Ljdk/internal/vm/ContinuationScope; scope = !fj9object 0xc0054be8 (offset = 12) (jdk/internal/vm/Continuation)
Ljava/lang/Runnable; runnable = !fj9object 0xc00c03f0 (offset = 16) (jdk/internal/vm/Continuation)
Ljdk/internal/vm/Continuation; parent = !fj9object 0x0 (offset = 20) (jdk/internal/vm/Continuation)
Z started = 0x00000001 (offset = 24) (jdk/internal/vm/Continuation)
Z finished = 0x00000000 (offset = 28) (jdk/internal/vm/Continuation)
Z isAccessible = 0x00000001 (offset = 32) (jdk/internal/vm/Continuation)
I continuationLink = 0xC00C0350 (offset = 36) (jdk/internal/vm/Continuation) <hidden>
}
Second walk (fail)
!J9Object 0x00000000C00C0350 {
struct J9Class* clazz = !j9class 0x216F00 // java/lang/VirtualThread$VThreadContinuation
Object flags = 0x00000010;
I lockword = 0x00000000 (offset = 0) (java/lang/Object) <hidden>
J vmRef = 0x000055EB0968A7D0 (offset = 4) (jdk/internal/vm/Continuation)
Ljdk/internal/vm/ContinuationScope; scope = !fj9object 0xc0054be8 (offset = 12) (jdk/internal/vm/Continuation)
Ljava/lang/Runnable; runnable = !fj9object 0xc00c03e0 (offset = 16) (jdk/internal/vm/Continuation)
Ljdk/internal/vm/Continuation; parent = !fj9object 0x0 (offset = 20) (jdk/internal/vm/Continuation)
Z started = 0x00000001 (offset = 24) (jdk/internal/vm/Continuation)
Z finished = 0x00000000 (offset = 28) (jdk/internal/vm/Continuation)
Z isAccessible = 0x00000000 (offset = 32) (jdk/internal/vm/Continuation)
I continuationLink = 0x00000000 (offset = 36) (jdk/internal/vm/Continuation) <hidden>
}
I'm trying to track down where that savedPC
came from. It appears to be pointing to valid JIT code (or at least it was valid at some point) but doesn't correspond to any method.
In terms of the bad PC, it is in the range of a valid method (java/util/concurrent/ForkJoinPool$WorkQueue.topLevelExec
) compiled at warm. However, in every case that has failed so far I observe we do an unsuccessful "profiled hot" compile of that method followed by a second "profiled hot" compile that succeeds. Barfing begins on the next stack walk after that profiled hot compile succeeds. It isn't clear to me yet why the first profiled compile fails and why the second succeeds, but the pattern is consistent enough that I suspect it is contributing to the problem in some way.
Under what circumstances do you remove/invalidate metadata for a compiled method? The assertion is triggered when metadata for the PC can not be located.
The problem is triggered by code cache reclamation. Code reclamation occurs after a method is recompiled and the old body is provably not reachable. The JIT proves this by hooking the GC exit events and walking the stacks of all active threads looking for the presence of the old body [1]. If the body isn't found on any thread stack then the method's code can be reclaimed.
The method in question for this problem is java/util/concurrent/ForkJoinPool$WorkQueue.topLevelExec
and the JIT does in fact reclaim the original compiled body. I added some instrumentation to the jitReleaseCodeStackWalk
function to report the J9VMThreads and stacks being scanned and the stack where the "Invalid JIT Return Address" is being reported is not among them. Hence, the JIT concludes the original body is not reachable.
Presumably there are thread stacks that are not attached to any J9VMThread that need to be scanned for live JITed method bodies. Can you confirm that statement is true? If so, is there a way to get a handle on those so the JIT can walk them? @fengxue-IS @babsingh @tajila
Presumably there are thread stacks that are not attached to any J9VMThread that need to be scanned for live JITed method bodies. Can you confirm that statement is true?
Yes this is true
if so, is there a way to get a handle on those so the JIT can walk them?
We will need to create a helper that iterates all continuations, or perhaps provide a more general helper to iterate all stacks.
The only way to find all stacks currently is to walk the vmthreadlist (as is currently being done) and then do another pass with the virtualthread list.
@fengxue-IS I think we need something like:
Iterates over all virtualthreads and platform threads. Both of which are represented as J9VMThreads. In the case
of virtualThreads, the thread state is copied from the J9VMContinuation in to the J9VMThread via
`copyFieldsFromContinuation`.
@param currentThread
@param userData to pass to callback
@param callback function
void
iterateAllVirtualAndPlatformThreads(J9VMThread *currentThread, void *userData, void (*callback)(J9VMThread *walkThread, void *userData) callback)
We can't do that as we are eliminating the fake J9VMThreads for continuations.
We will need to be able to find all continuations in some way - if continuations are mapped 1-1 with vthreads then walking the existing vthread list would do it.
gc maintans a list of all continuation objects allocated. we can ask for a iterator to walk all native stack associated with continuaton objects. @LinHu2016 fyi
Sounds like the short term answer is to do
omrthread_monitor_enter(vm->liveVirtualThreadListMutex);
while (vm->inspectingLiveVirtualThreadList) {
/* Virtual thread list is being inspected, wait. */
vmFuncs->internalExitVMToJNI(currentThread);
omrthread_monitor_wait(vm->liveVirtualThreadListMutex);
vmFuncs->internalEnterVMFromJNI(currentThread);
}
vm->inspectingLiveVirtualThreadList = TRUE;
omrthread_monitor_exit(vm->liveVirtualThreadListMutex);
if (NULL != vm->liveVirtualThreadList) {
/* Skip the root, which is a dummy virtual thread and global ref. */
j9object_t walkVirtualThread = J9OBJECT_OBJECT_LOAD(currentThread, root, vm->virtualThreadLinkNextOffset);
do {
J9VMThread stackThread = {0};
J9VMEntryLocalStorage els = {0};
j9object_t contObject = (j9object_t)J9VMJAVALANGVIRTUALTHREAD_CONT(currentThread, walkVirtualThread);
continuation = J9VMJDKINTERNALVMCONTINUATION_VMREF(currentThread, contObject);
vm->internalVMFunctions->copyFieldsFromContinuation(currentThread, &stackThread, &els, continuation);
//add your code here to operate on stackThread
walkVirtualThread = J9OBJECT_OBJECT_LOAD(currentThread, walkVirtualThread, vm->virtualThreadLinkNextOffset);
} while (root != walkVirtualThread);
}
omrthread_monitor_enter(vm->liveVirtualThreadListMutex);
vm->inspectingLiveVirtualThreadList = FALSE;
omrthread_monitor_notify_all(vm->liveVirtualThreadListMutex);
omrthread_monitor_exit(vm->liveVirtualThreadListMutex);
If this is the solution, I dont thinking fixing it should be held up by the work Gengchen is doing.
gc maintans a list of all continuation objects allocated. we can ask for a iterator to walk all native stack associated with continuaton objects.
Long term this is probably the best approach. I think we need a stack iterator, it doesnt have to operate on J9VMthreads though.
Something like walkAllStackFrames
could encapsulate this by walking the thread list and calling walkStackFrames
then walking the continuation list and using walkContinuationStackFrames
- the walkState and iterators can (by design) be shared between the two cases.
This could be implemented now and will not be affected by Gengchen's work (because it is not changing the walking APIs, just the implemenation).
Something like
walkAllStackFrames
could encapsulate this by walking the thread list and callingwalkStackFrames
then walking the continuation list and usingwalkContinuationStackFrames
- the walkState and iterators can (by design) be shared between the two cases.This could be implemented now and will not be affected by Gengchen's work (because it is not changing the walking APIs, just the implemenation).
Such API should be always under exclusive VMAccess, so we wouldn't need to aquire vthread related mutex/counter. we can share the same callback and stackwalk flags as input arguments
I can draft something for testing/review by Monday if we agree on this approach.
Note, I checked the code where jitReleaseCodeStackWalk
is being called, looks like it can also occur under local gc which may not be holding exclusive access. in such cases, we may not be able to determine the correct continuation stack to scan since multiple mount/unmount can take place on the same carrier thread (may be possible to remounted on a different carrier thread).
The GC start/end events are always sent while holding exclusive.
After taking a second look at how jitReleaseCodeStackWalk
works, I realized that adding an API like walkAllStackFrames
will only work with non-realtimeGC. There is a special logic used by JIT under realtimeGC/incrementalCCR that require custom actions performed between walking different stacks.
For those cases, it will not be possible to provide a generic API to cover the JIT use case. Special code will need to be added to ensure continuation stacks are properly walked. And we also need to add a new field to continuation for setting a flag to denote if it has been walked or not.
I am also not clear on how condYieldFromGCFunctionPtr
is suppose to work (does it pause the currentThread and allow other threads to run? ie. release exclusive access and re-acquire). In such cases, if continuation is allowed to mount/unmounted, the state will become more complicated and result may not be what is expected.
@LinHu2016 do you have any insight to the realtime GC case?
I am in the process of prototyping a solution that uses the skeleton code Tobi provided above right in jitReleaseCodeStackWalk
, and I noticed the special handling for thread yields for realtime and was curious how your implementation was going to handle that. If my prototype works then it may be an adequate solution for JDK 19 (and even beyond).
and I noticed the special handling for thread yields for realtime and was curious how your implementation was going to handle that.
I think if we are doing this in two passes, one for J9VMThreads, and one for virtualthreads, then we dont need any special handling for that. Any stack that is found via the virtualthread list is not currently active (in unmounted/yield state).
I'll just confirm that my prototype fix appears to resolve the problem.
local testing with #16374 against Skynet.java test: 5/5 passes for default GC (gencon) 3/3 passed for metronome(I noticed that there may be upto 30x regression compared with gencon is test performance) 1/1 crashed when running balance GC.
Not sure if crash in BalancedGC is an intermittent issue or not, have not seen on other GC policy( or I haven't ran enough times to trigger it)
Updated commit: https://github.com/eclipse-openj9/openj9/pull/16374/commits/dfebbcd7d84ad5512b40e8fbf7628142077538e1 I modified the realtimeGC code to double scan the vmthread and avoid locking vthread mount/unmount, seem like that fixed the issue with pref regression. 2/2 passed for gencon 2/2 passed for metronome 1/1 Crash in Balanced GC still exist, will look into core later
How shall we proceed with this?
I have a fix that implements the stack walk in place in jitReleaseCodeStackWalk
that has passed testing for JDK 11 and 19 on all platforms. I am ready to open a PR for that.
Alternatively, if your universal stack walk feature will be available soon then I can incorporate that instead.
However, given where we are in the release cycle I would prefer the first option because it is less disruptive in terms of code changes. I'm worried about introducing subtle problems that will be hard to track down late in the release.
@fengxue-IS We need a definitive answer on whether all continuations can currently be found via the vthread list.
@fengxue-IS We need a definitive answer on whether all continuations can currently be found via the vthread list.
For JDK19 this is the case.
OK so the current solution will work. In the future, the code really belongs in a VM helper, but that's complicated by realtime (which is basically the antithesis or loom anyway).
I've renamed the issue to more accurated represent the problem
https://openj9-jenkins.osuosl.org/job/Test_openjdk19_j9_sanity.openjdk_ppc64_aix_Nightly/63/ jdk_lang_1 java/lang/Thread/virtual/stress/GetStackTraceALot.java#id0
23:45:13 STDERR:
23:45:13
23:45:13
23:45:13 *** Invalid JIT return address 0000000000000000 in 000001002312DE00
23:45:13
23:45:13 04:41:21.095 0x1002312db00 j9vm.249 * ** ASSERTION FAILED ** at /home/jenkins/workspace/Build_JDK19_ppc64_aix_Nightly/openj9/runtime/vm/swalk.c:1632: ((0 ))
Just FYI that my workaround fix also crashed with balanced GC with the same backtrace as Jack's PR.
https://openj9-jenkins.osuosl.org/job/Test_openjdk19_j9_sanity.openjdk_s390x_linux_Nightly/67 jdk_lang_1 java/lang/Thread/virtual/stress/GetStackTraceALot.java#id0
22:05:16 *** Invalid JIT return address 000003FF696C6E14 in 000003FFA8A7E190
22:05:16
22:05:16 03:04:47.621 0x3ffa4508600 j9vm.249 * ** ASSERTION FAILED ** at /home/jenkins/workspace/Build_JDK19_s390x_linux_Nightly/openj9/runtime/vm/swalk.c:1632: ((0 ))
Likely due to lack of JIT support.