Loom: jdk19 OpenJDK Invalid JIT return address ASSERTION FAILED swalk.c:1601

tajila commented 2 years ago

Likely due to lack of JIT support.

FAILED  java/lang/Thread/virtual/stress/TimedGet.java
            STDOUT:
            Timeout refired 480 times
            34826
            45401
            56520
            67517
            77472
            87823
            98800
            110187
            121007
            131805
            142457
            152339
            162664
            173232
            183909
            194413
            205068
            215806
            226674
            238249
            250000
            STDERR:
            STATUS:Passed.

vij-singh commented 1 year ago

With Loom support now in, are these still an issue?

tajila commented 1 year ago

@fengxue-IS Is this test passing now?

tajila commented 1 year ago

@fengxue-IS any update on this?

fengxue-IS commented 1 year ago

This test passed with -Xint, failed when JIT is enabled due to Invalid JIT return address. Similar failure seen on java/lang/Thread/virtual/stress/PingPong.java on aarch64 and x86

Failed: main --enable-preview PingPong LTQ 1000000

I am working on triaging this, may need help from JIT team with understanding this cause

fengxue-IS commented 1 year ago

I was able to get a core locally by directly compiling the test class java/lang/Thread/virtual/stress/PingPong.java and run with java --enable-preview PingPong

the invalid JIT Return address comes from GC walking JIT stacks on carrier thread after the continuation have yielded Stacktrace:

Thread 29 "ForkJoinPool-1-" hit Breakpoint 1, jitWalkStackFrames (walkState=0x7fffc8753dc0) at /root/openj9-openjdk-jdk19/openj9/runtime/codert_vm/jswalk.c:282
282                 walkState->walkThread->privateFlags |= J9_PRIVATE_FLAGS_STACK_CORRUPT;
(gdb) bt
#0  jitWalkStackFrames (walkState=0x7fffc8753dc0) at /root/openj9-openjdk-jdk19/openj9/runtime/codert_vm/jswalk.c:282
#1  0x00007ffff67a46a3 in walkStackFrames (currentThread=<optimized out>, walkState=0x7fffc8753dc0) at /root/openj9-openjdk-jdk19/openj9/runtime/vm/swalk.c:345
#2  0x00007ffff44a516c in GC_VMThreadStackSlotIterator::scanSlots (vmThread=vmThread@entry=0x252200, walkThread=walkThread@entry=0x252200, userData=userData@entry=0x7fffc87540e0, 
    oSlotIterator=oSlotIterator@entry=0x7ffff449ca70 <stackSlotIterator(J9JavaVM*, J9Object**, void*, J9StackWalkState*, void const*)>, includeStackFrameClassReferences=<optimized out>, 
    trackVisibleFrameDepth=<optimized out>) at /root/openj9-openjdk-jdk19/openj9/runtime/gc_structs/VMThreadStackSlotIterator.cpp:129
#3  0x00007ffff449c532 in MM_RootScanner::scanOneThread (this=0x7fffc8754150, env=0x7fff70026948, walkThread=0x252200, localData=0x7fffc87540e0)
    at /root/openj9-openjdk-jdk19/openj9/runtime/gc_base/RootScanner.cpp:542
#4  0x00007ffff449b23f in MM_RootScanner::scanThreads (this=0x7fffc8754150, env=0x7fff70026948) at /root/openj9-openjdk-jdk19/openj9/runtime/gc_base/RootScanner.cpp:508
#5  0x00007ffff449e172 in MM_RootScanner::scanRoots (this=0x7fffc8754150, env=0x7fff70026948) at /root/openj9-openjdk-jdk19/openj9/runtime/gc_base/RootScanner.cpp:972
#6  0x00007ffff4604eab in MM_MarkingDelegate::scanRoots (this=0x7ffff0047770, env=env@entry=0x7fff70026948, processLists=processLists@entry=true)
    at /root/openj9-openjdk-jdk19/openj9/runtime/gc_glue_java/MarkingDelegate.cpp:310
#7  0x00007ffff457ce6c in MM_MarkingScheme::markLiveObjectsRoots (this=<optimized out>, env=env@entry=0x7fff70026948, processLists=processLists@entry=true)
    at /root/openj9-openjdk-jdk19/omr/gc/base/MarkingScheme.cpp:363
#8  0x00007ffff462b8d9 in MM_ParallelMarkTask::run (this=0x7fffc8754230, env=0x7fff70026948) at /root/openj9-openjdk-jdk19/omr/gc/base/ParallelMarkTask.cpp:52
#9  0x00007ffff45951be in MM_ParallelDispatcher::run (this=0x7ffff00474c0, env=0x7fff70026948, task=0x7fffc8754230, newThreadCount=<optimized out>)
    at /root/openj9-openjdk-jdk19/omr/gc/base/ParallelDispatcher.cpp:588
#10 0x00007ffff45b07d5 in MM_ParallelGlobalGC::markAll (this=this@entry=0x7ffff000b740, env=env@entry=0x7fff70026948, initMarkMap=initMarkMap@entry=true)
    at /root/openj9-openjdk-jdk19/omr/gc/base/standard/ParallelGlobalGC.cpp:955
#11 0x00007ffff45b11a0 in MM_ParallelGlobalGC::mainThreadGarbageCollect (this=0x7ffff000b740, env=0x7fff70026948, allocDescription=0x7fffc8754668, initMarkMap=<optimized out>, rebuildMarkBits=<optimized out>)
    at /root/openj9-openjdk-jdk19/omr/gc/base/standard/ParallelGlobalGC.cpp:461
#12 0x00007ffff45ad552 in MM_ParallelGlobalGC::internalGarbageCollect (this=<optimized out>, env=0x7fff70026948, subSpace=<optimized out>, allocDescription=0x7fffc8754668)
    at /root/openj9-openjdk-jdk19/omr/gc/base/standard/ParallelGlobalGC.cpp:1234
#13 0x00007ffff4571ba6 in MM_Collector::garbageCollect (this=0x7ffff000b740, env=0x7fff70026948, callingSubSpace=0x7ffff0080c40, allocateDescription=0x7fffc8754668, gcCode=<optimized out>, 
    objectAllocationInterface=0x7fff70029460, baseSubSpace=0x7ffff00808b0, context=0x0) at /root/openj9-openjdk-jdk19/omr/gc/base/Collector.cpp:500
#14 0x00007ffff4626e00 in MM_MemorySubSpaceFlat::allocationRequestFailed (this=0x7ffff0080c40, env=0x7fff70026948, allocateDescription=0x7fffc8754668, allocationType=MM_MemorySubSpace::ALLOCATION_TYPE_TLH, 
    objectAllocationInterface=0x7fff70029460, baseSubSpace=0x7ffff00808b0, previousSubSpace=0x7ffff00808b0) at /root/openj9-openjdk-jdk19/omr/gc/base/MemorySubSpaceFlat.cpp:154
#15 0x00007ffff4628f18 in MM_MemorySubSpaceGeneric::allocateTLH (this=0x7ffff00808b0, env=0x7fff70026948, allocDescription=0x7fffc8754668, objectAllocationInterface=0x7fff70029460, baseSubSpace=0x0, 
    previousSubSpace=<optimized out>, shouldCollectOnFailure=true) at /root/openj9-openjdk-jdk19/omr/gc/base/MemorySubSpaceGeneric.cpp:377
#16 0x00007ffff459c9d4 in MM_TLHAllocationSupport::refresh (this=this@entry=0x7fff70029510, env=0x7fff70026948, allocDescription=allocDescription@entry=0x7fffc8754668, shouldCollectOnFailure=<optimized out>)
    at /root/openj9-openjdk-jdk19/omr/gc/base/TLHAllocationSupport.cpp:210
#17 0x00007ffff459cbbe in MM_TLHAllocationSupport::allocateFromTLH (this=0x7fff70029510, env=<optimized out>, allocDescription=0x7fffc8754668, shouldCollectOnFailure=<optimized out>)
    at /root/openj9-openjdk-jdk19/omr/gc/base/TLHAllocationSupport.cpp:279
#18 0x00007ffff459b51a in MM_TLHAllocationInterface::allocateObject (this=0x7fff70029460, env=0x7fff70026948, allocDescription=0x7fffc8754668, memorySpace=0x7ffff0080f90, shouldCollectOnFailure=true)
    at /root/openj9-openjdk-jdk19/omr/gc/base/TLHAllocationInterface.cpp:183
#19 0x00007ffff45a179b in MM_AllocateInitialization::allocateAndInitializeObject (omrVMThread=<optimized out>, this=0x7fffc8754650) at /root/openj9-openjdk-jdk19/omr/gc/base/AllocateInitialization.hpp:201
#20 OMR_GC_AllocateObject (omrVMThread=<optimized out>, allocator=allocator@entry=0x7fffc8754650) at /root/openj9-openjdk-jdk19/omr/gc/startup/omrgcalloc.cpp:39
#21 0x00007ffff44aaee6 in J9AllocateObject (vmThread=0x252200, clazz=0x239900, allocateFlags=<optimized out>) at /root/openj9-openjdk-jdk19/openj9/runtime/gc_modron_startup/mgcalloc.cpp:416
#22 0x00007fffefb6e7bd in slow_jitNewObjectImpl (nonZeroTLH=false, checkClassInit=true, currentThread=0x252200) at /root/openj9-openjdk-jdk19/openj9/runtime/codert_vm/cnathelp.cpp:573
#23 old_slow_jitNewObject (currentThread=0x252200) at /root/openj9-openjdk-jdk19/openj9/runtime/codert_vm/cnathelp.cpp:637
#24 0x00007fffefb83471 in jitNewObject () at /root/openj9-openjdk-jdk19/build/linux-x86_64-server-release/vm/runtime/codert_vm/xnathelp.s:1265
#25 0x0000000000252200 in ?? ()

* 29   Thread 0x7fffc8757700 (LWP 13730) "ForkJoinPool-1-" jitWalkStackFrames (walkState=0x7fffc8753dc0) at /root/openj9-openjdk-jdk19/openj9/runtime/codert_vm/jswalk.c:282

The Java stack of thread 29 in DDR:

    !stack 0x0024d700   !j9vmthread 0x0024d700  !j9thread 0x7fff70009788    tid 0x35a1 (13729) // (ForkJoinPool-1-worker-4)
    !stack 0x00252200   !j9vmthread 0x00252200  !j9thread 0x7fff70024790    tid 0x35a2 (13730) // (ForkJoinPool-1-worker-5)
    !stack 0x00255f00   !j9vmthread 0x00255f00  !j9thread 0x7fff70024d08    tid 0x35a3 (13731) // (ForkJoinPool-1-worker-6)
    !stack 0x00259b00   !j9vmthread 0x00259b00  !j9thread 0x7fff5400afc0    tid 0x35a4 (13732) // (ForkJoinPool-1-worker-7)

> !stack 0x252200
<252200>                            known but unhandled frame type com.ibm.j9ddr.vm29.pointer.U8Pointer @ 0x00000005 

 FAULT  FAULT  FAULT  FAULT  FAULT  FAULT  FAULT  FAULT  FAULT  FAULT  FAULT  FAULT  FAULT  FAULT  FAULT  FAULT 

<252200>    !j9method 0x0000000000220338   java/util/concurrent/ForkJoinTask.adapt(Ljava/lang/Runnable;)Ljava/util/concurrent/ForkJoinTask;
<252200>    !j9method 0x0000000000094E40   java/lang/VirtualThread.submitRunContinuation(Z)V
<252200>    !j9method 0x0000000000094E80   java/lang/VirtualThread.lazySubmitRunContinuation()V
<252200>    !j9method 0x0000000000094F20   java/lang/VirtualThread.afterYield()V
<252200>    !j9method 0x0000000000094E20   java/lang/VirtualThread.runContinuation()V
Oct 31, 2022 2:55:27 PM com.ibm.j9ddr.vm29.events.DefaultEventListener corruptData
WARNING: CorruptDataException thrown walking stack. walkThread = 0x0000000000252200
com.ibm.j9ddr.AddressedCorruptDataException: Invalid JIT return address

fengxue-IS commented 1 year ago

Update It seems that PingPong test failure is intermittent. using TimedGet produced a more consistent failure for triaging. I noticed that there seem to be two distinct issues with TimedGet, with JIT enabled, I would see failure due to Invalid JIT return address I will try to see if this can be narrowed down to where the PC is being set on stack.

When testing with -Xint there is an issue with corrupted stack object. This seem to be only an issue with gencon, running the test under -Xgcpolicy:balanced / optthruput both successfully completed Discussed this with @LinHu2016 , and he will be investigating this case.

LinHu2016 commented 1 year ago

it looks similar issue as https://github.com/eclipse-openj9/openj9/issues/16212

0xdaryl commented 1 year ago

@nbhuiyan : please investigate this from a JIT perspective.

babsingh commented 1 year ago

The original timeout no longer occurs. Currently, there are two failures associated to this test:

JIT invalid address: This may be identical to https://github.com/eclipse-openj9/openj9/issues/16249. This issue is more reproducible than #16249.
GC race condition (PR with the fix will be opened by @LinHu2016 soon).

nbhuiyan commented 1 year ago

I discussed with @fengxue-IS about the JIT invalid return address issue running the PingPong test. Jack reported that running with -Xjit:disableInlining,optLevel=noOpt prevented the crash from happening. I was able to reproduce the test failure locally at around the same failure rate that Jack reported earlier, using the command java --enable-preview -Xmx1g PingPong LTQ 1000000.

-Xjit:disableInlining,optLevel=noOpt preventing the failure from occurring does not necessarily mean that the failure was caused by a bad optimization or something going wrong during inlining. It seems that just adding JIT options significantly reduces the failure rate. I experimented with -Xjit options that are specific to a different platform and would not impact the compilation behaviour on x86, such as -Xjit:TR_DisableZ10, and that was sufficient to reduce the failure rate from 50% to less than 1% (verified using a bash for loop to run the test repeatedly). I needed to obtain verbose logs to get a list of methods compiled for a failed run, and then try narrowing down to the smallest set of method(s) needed to be compiled to reproduce the problem. I had to increase the iteration option of the PingPong test to a ridiculously large number and that seems to bring the failure rate to a reasonable level.

nbhuiyan commented 1 year ago

I have obtained a JIT verbose log from a failing run and that contained some of the methods seen in the strack trace from investigating a core dump. I was trying to narrow down to the smallest set of methods that need to be compiled in order for the invalid return address to occur, so that I can focus on what happens during the compilation of those method(s). However, using limit file to try various subsets of methods causes the problem to hide, so I have yet to figure out a way to progress further with that. I have also tried the GC race condition fix in #16290, and still saw the problem happening.

tajila commented 1 year ago

@nbhuiyan Do you think this will be resolved within 2 weeks?

nbhuiyan commented 1 year ago

@tajila

Do you think this will be resolved within 2 weeks?

It's possible, depending on what the actual cause is. Examining the JIT logs of the methods leading up to the frame where invalid JIT return address occurs during a failed run does not show anything going wrong there, so I am starting to somewhat suspect that the problem may not be JIT-related.

nbhuiyan commented 1 year ago

Some updates with this investigation:

The invalid JIT return address is always within the JIT code cache, in an unused region between two methods that can vary every run
Examining the stack slots show a number of inlined JIT frames in a JIT-ed method. Explicitly preventing every single one of those methods from getting inlined reproduces the failure with the root method (ForkJoinTask.doExec) at the top.
It was possible to reproduce the failure with ForkJoinTask.doExec prevented from getting compiled, but this time we see a failure elsewhere, and get either a segfault or another JIT invalid return address error:

Segfault case, where we end up in the invalid JIT code cache region:

Unhandled exception
Type=Segmentation error vmState=0x00000000
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
Handler1=00007FC61DCB0F10 Handler2=00007FC61D4B41D0 InaccessibleAddress=0000000000000000
RDI=000000000000001F RSI=0000000000000001 RAX=0000000000000000 RBX=0000000000217B00
RCX=0000000000000020 RDX=00000000003D9BE1 R8=0000000000000001 R9=00000000FFEDEF88
R10=000000000021CA78 R11=00000000C0092938 R12=00000000FFEDEFE8 R13=0000000000092B88
R14=00000000C0000000 R15=0000000000227600
RIP=00007FC5FC4EC0CB GS=0000 FS=0000 RSP=000000000021CB20
EFlags=0000000000010206 CS=0033 RBP=000000000023C000 ERR=0000000000000006
TRAPNO=000000000000000E OLDMASK=0000000000000000 CR2=0000000000000000
xmm0 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm1 00000000003d9be0 (f: 4037600.000000, d: 1.994839e-317)
xmm2 000000000021caf8 (f: 2214648.000000, d: 1.094181e-317)
xmm3 000000000021cc99 (f: 2215065.000000, d: 1.094388e-317)
xmm4 00007fc61de37cca (f: 501447872.000000, d: 6.941073e-310)
xmm5 0000000000000038 (f: 56.000000, d: 2.766768e-322)
xmm6 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm7 00000000c00c7dc0 (f: 3222044160.000000, d: 1.591901e-314)
xmm8 00000000001d35e8 (f: 1914344.000000, d: 9.458116e-318)
xmm9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm10 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm11 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm15 0000000000000000 (f: 0.000000, d: 0.000000e+00)

Compiled_method=unknown (In JIT code segment 00007FC6180C7F78 but no method found)

JIT invalid return address case, where we fail during stackwalk:

*** Invalid JIT return address 00007FD2434CD649 in 00007FD240472440

Looking at how we end up with the invalid JIT return address:

...
#14 0x00007fd264e208f3 in invalidJITReturnAddress (
    walkState=<optimized out>)
    at /root/builds/openj9-openjdk-jdk19/openj9/runtime/vm/swalk.c:1632
#15 0x00007fd25f4a74b5 in jitWalkStackFrames (
    walkState=0x7fd240472440)
    at /root/builds/openj9-openjdk-jdk19/openj9/runtime/codert_vm/jswalk.c:283
#16 0x00007fd264e1f6c5 in walkStackFrames (
    currentThread=<optimized out>, walkState=0x7fd240472440)
    at /root/builds/openj9-openjdk-jdk19/openj9/runtime/vm/swalk.c:384
#17 0x00007fd25ec7f8a1 in jitMethodSampleInterrupt (
    vmThread=0x259d00, handlerKey=<optimized out>,
    userData=<optimized out>)
    at /root/builds/openj9-openjdk-jdk19/openj9/runtime/compiler/control/HookedByTheJit.cpp:1278
#18 0x00007fd264dde507 in dispatchAsyncEvents (
    currentThread=0x259d00, asyncEventFlags=1)
    at /root/builds/openj9-openjdk-jdk19/openj9/runtime/vm/hookableAsync.c:238
#19 0x00007fd264db5de8 in javaCheckAsyncMessages (
    currentThread=0x259d00, throwExceptions=0)
    at /root/builds/openj9-openjdk-jdk19/openj9/runtime/vm/AsyncMessageHandler.cpp:61
#20 0x00007fd25f4931e5 in old_slow_jitCheckAsyncMessages (
    currentThread=0x259d00)
    at /root/builds/openj9-openjdk-jdk19/openj9/runtime/codert_vm/cnathelp.cpp:1431
#21 0x00007fd25f4aa8d6 in jitCheckAsyncMessages ()
    at /root/builds/openj9-openjdk-jdk19/build/linux-x86_64-server-release/vm/runtime/codert_vm/xnathelp.s:1780
#22 0x0000000000259d00 in ?? ()   <--- J9VMThread
#23 0x00007fd240472b50 in ?? ()   <--- Ptr Obj - {java/util/concurrent/ForkJoinTask$AdaptedRunnableAction} +8
#24 0x00000000ffe52a90 in ?? ()   <--- Obj - {java/util/concurrent/ForkJoinTask$AdaptedRunnableAction}
#25 0x0000000000217700 in ?? ()   <--- J9Class - {java/lang/VirtualThread$$Lambda$8/0x0000000000000000}
#26 0x0000000000071c1f in ?? ()   <--- CP - {java/lang/Module} + 1551
#27 0x0000000000227600 in ?? ()   <--- J9Class - {java/util/concurrent/ForkJoinTask}
#28 0x0000000000259d00 in ?? ()   <--- J9VMThread
#29 0x0000000000010000 in ?? ()   <--- unknown
#30 0x00007fd264f63cca in j9_void_void ()
   from /root/builds/openj9-openjdk-jdk19/build/linux-x86_64-server-release/images/jdk/lib/default/libj9vm29.so
#31 0x0000000000000000 in ?? ()

The stack walker is triggered through jitCheckAsyncMessages, and the J9VMThread is 0x259d00. Here is what we have in that J9VMThread at the time of the crash:

J9VMThread at 0x259d00 {
  Fields for J9VMThread:
    0x0: struct JNINativeInterface_* functions = !jninativeinterface_ 0x00007FD2651A6B60
    0x8: struct J9JavaVM* javaVM = !j9javavm 0x00007FD26000FA80
    0x10: UDATA* arg0EA = !j9x 0x0000000000259A98
    0x18: UDATA* bytecodes = !j9x 0x0000000000000000
    0x20: UDATA* sp = !j9x 0x0000000000259A78
    0x28: U8* pc = !j9x 0x0000000000000005
    0x30: struct J9Method* literals = !j9method 0x0000000000000000
    0x38: UDATA jitStackFrameFlags = 0x0000000000000000 (0)
    0x40: struct J9Object* jitException = !j9object 0x0000000000000000
    0x48: struct J9Object* currentException = !j9object 0x0000000000000000
    0x50: UDATA* stackOverflowMark = !j9x 0x0000000000259450
    0x58: UDATA* stackOverflowMark2 = !j9x 0x0000000000259450
    0x60: U8* heapAlloc = !j9x 0x00000000FFE52AD8 // ""
    0x68: U8* heapTop = !j9x 0x00000000FFE53018 // "���"
    0x70: IDATA tlhPrefetchFTA = 0x0000000000000000 (0)
    0x78: U8* nonZeroHeapAlloc = !j9x 0x0000000000000000
    0x80: U8* nonZeroHeapTop = !j9x 0x0000000000000000
    0x88: IDATA nonZeroTlhPrefetchFTA = 0x0000000000000000 (0)
    0x90: struct J9ThreadMonitor* publicFlagsMutex = !j9threadmonitor 0x00007FD1D8009428
    0x98: UDATA publicFlags = 0x0000000000000020 (32)
    0xa0: UDATA compressObjectReferences = 0x0000000000000001 (1)
    0xa8: struct J9Object* threadObject = !j9object 0x00000000C00A3610 // jdk/internal/misc/CarrierThread
    0xb0: void* lowTenureAddress = !j9x 0x00000000C0000000
    0xb8: void* highTenureAddress = !j9x 0x00000000C0600000
    0xc0: void* heapBaseForActiveCardTable = !j9x 0x0000000000000000
    0xc8: void* activeCardTableBase = !j9x 0x00007FD25C93C000
    0xd0: UDATA heapSizeForActiveCardTable = 0x0000000000000000 (0)
    0xd8: void* heapBaseForBarrierRange0 = !j9x 0x00000000C0000000
    0xe0: UDATA heapSizeForBarrierRange0 = 0x0000000000600000 (6291456)
    0xe8: UDATA* jniLocalReferences = !j9x 0x0000000000000000
    0xf0: UDATA tempSlot = 0x00007FD25F4B4810 (140541518628880)
    0xf8: void* jitReturnAddress = !j9x 0x00007FD2434CD649
    0x100: void* floatTemp1 = !j9x 0x00007FD2434CD545
    0x108: void* floatTemp2 = !j9x 0x00000000FFE90018
    0x110: void* floatTemp3 = !j9x 0x00007FD2434BF0D4
    0x118: void* floatTemp4 = !j9x 0x0000000000000000
    0x120: UDATA returnValue = 0x00007FD280000000 (140542067343360)
    0x128: UDATA returnValue2 = 0x0000000000000001 (1)
    0x130: UDATA* objectFlagSpinLockAddress = !j9x 0x0000000000000000
    0x138: struct J9JavaStack* stackObject = !j9javastack 0x0000000000256C50
    0x140: struct J9Thread* osThread = !j9thread 0x00007FD1E4010540
    0x148: UDATA inspectionSuspendCount = 0x0000000000000000 (0)
    0x150: UDATA inspectorCount = 0x0000000000000000 (0)
    0x158: U32 eventFlags = 0x00000010 (16)
    0x15c: U32 osrFrameIndex = 0x00000000 (0)
    0x160: void* codertTOC = !j9x 0x0000000000000000
    0x168: U8* cardTableVirtualStart = !j9x 0x00007FD25C33C000 // "9ddr/vm29/pointer/StructurePointer"
    0x170: struct J9Object* stopThrowable = !j9object 0x0000000000000000
    0x178: struct J9Object* outOfMemoryError = !j9object 0x00000000C00A3698 // java/lang/OutOfMemoryError
    0x180: UDATA* jniCurrentReference = !j9x 0x0000000000000000
    0x188: UDATA* jniLimitReference = !j9x 0x0000000000000000
    0x190: struct J9VMThread* linkNext = !j9vmthread 0x000000000025DA00
    0x198: struct J9VMThread* linkPrevious = !j9vmthread 0x0000000000233500
    0x1a0: UDATA privateFlags = 0x0000000000040002 (262146)
    0x1a8: UDATA jitTOC = 0x0000000000000000 (0)
    0x1b0: UDATA ferReturnType = 0x0000000000000000 (0)
    0x1b8: U64 ferReturnValue = 0x0000000000000000 (0)
    0x1c0: U64 mgmtBlockedTimeTotal = 0x0000000000000000 (0)
    0x1c8: U64 mgmtBlockedTimeStart = 0x0000000000000000 (0)
    0x1d0: U64 mgmtWaitedTimeTotal = 0x0000000000000000 (0)
    0x1d8: U64 mgmtWaitedTimeStart = 0x0000000000000000 (0)
    0x1e0: UDATA jniVMAccessCount = 0x0000000000000000 (0)
    0x1e8: UDATA debugEventData1 = 0x0000000000000000 (0)
    0x1f0: UDATA debugEventData2 = 0x0000000000000000 (0)
    0x1f8: UDATA debugEventData3 = 0x0000000000000000 (0)
    0x200: UDATA debugEventData4 = 0x0000000000000000 (0)
    0x208: UDATA debugEventData5 = 0x0000000000000000 (0)
    0x210: UDATA debugEventData6 = 0x0000000000000000 (0)
    0x218: UDATA debugEventData7 = 0x0000000000000000 (0)
    0x220: UDATA debugEventData8 = 0x0000000000000000 (0)
    0x228: struct J9StackElement* classLoadingStack = !j9stackelement 0x0000000000000000
    0x230: UDATA jitTransitionJumpSlot = 0x0000000000000000 (0)
    0x238: struct J9ThreadMonitor* gcClassUnloadingMutex = !j9threadmonitor 0x0000000000000000
    0x240: struct J9VMThread* gcClassUnloadingThreadPrevious = !j9vmthread 0x0000000000000000
    0x248: struct J9VMThread* gcClassUnloadingThreadNext = !j9vmthread 0x0000000000000000
    0x250: struct J9StackWalkState* stackWalkState = !j9stackwalkstate 0x000000000025A000
    0x258: struct J9VMEntryLocalStorage* entryLocalStorage = !j9vmentrylocalstorage 0x00007FD240472B70
    0x260: UDATA gpProtected = 0x0000000000000001 (1)
    0x268: struct J9VMGCSublistFragment gcRememberedSet = !j9vmgcsublistfragment 0x0000000000259F68
    0x298: struct MM_GCRememberedSetFragment sATBBarrierRememberedSetFragment = !mm_gcrememberedsetfragment 0x0000000000259F98
    0x2c8: void* gcTaskListPtr = !j9x 0x0000000000000000
    0x2d0: UDATA* dropBP = !j9x 0x0000000000000000
    0x2d8: UDATA dropFlags = 0x0000000000000000 (0)
    0x2e0: struct J9Pool* monitorEnterRecordPool = !j9pool 0x00007FD1E40116F0
    0x2e8: struct J9MonitorEnterRecord* monitorEnterRecords = !j9monitorenterrecord 0x0000000000000000
    0x2f0: UDATA* jniArrayCache = !j9x 0x0000000000000000
    0x2f8: UDATA* jniArrayCache2 = !j9x 0x0000000000000000
    0x300: struct J9StackWalkState inlineStackWalkState = !j9stackwalkstate 0x000000000025A000
    0x538: struct J9JITDecompilationInfo* decompilationStack = !j9jitdecompilationinfo 0x0000000000000000
    0x540: struct J9ModronThreadLocalHeap allocateThreadLocalHeap = !j9modronthreadlocalheap 0x000000000025A240
    0x570: struct J9ModronThreadLocalHeap nonZeroAllocateThreadLocalHeap = !j9modronthreadlocalheap 0x000000000025A270
    0x5a0: void* sidecarEvent = !j9x 0x0000000000000000
    0x5a8: struct PortlibPTBuffers_struct* ptBuffers = !portlibptbuffers_ 0x0000000000000000
    0x5b0: struct J9Object* blockingEnterObject = !j9object 0x0000000000000000
    0x5b8: void* gcExtensions = !j9x 0x00007FD1E4012878
    0x5c0: void* gpInfo = !j9x 0x0000000000000000
    0x5c8: void* jitVMwithThreadInfo = !j9x 0x00007FD1D4003210
    0x5d0: U8* profilingBufferEnd = !j9x 0x00007FD1D00031C0 // "!�f"�
    0x5d8: U8* profilingBufferCursor = !j9x 0x00007FD1D0002E32 // "���@�"
    0x5e0: UDATA* j2iFrame = !j9x 0x0000000000000000
    0x5e8: UDATA currentOSStackFree = 0x000000000007CBD0 (510928)
    0x5f0: UDATA mgmtBlockedCount = 0x0000000000000000 (0)
    0x5f8: UDATA mgmtWaitedCount = 0x0000000000000230 (560)
    0x600: UDATA mgmtBlockedStart = 0x0000000000000000 (0)
    0x608: UDATA mgmtWaitedStart = 0x0000000000000000 (0)
    0x610: UDATA cardTableShiftSize = 0x0000000000000009 (9)
    0x618: void* aotVMwithThreadInfo = !j9x 0x0000000000000000
    0x620: UDATA asyncEventFlags = 0x0000000000000000 (0)
    0x628: struct J9Object* forceEarlyReturnObjectSlot = !j9object 0x0000000000000000
    0x630: struct J9MonitorEnterRecord* jniMonitorEnterRecords = !j9monitorenterrecord 0x0000000000000000
    0x638: struct J9DLTInformationBlock dltBlock = !j9dltinformationblock 0x000000000025A338
    0x7f8: struct J9VMGCSegregatedAllocationCacheEntry* segregatedAllocationCache = !j9vmgcsegregatedallocationcacheentry 0x0000000000000000
    0x800: struct J9StackWalkState* activeWalkState = !j9stackwalkstate 0x00007FD240472440
    0x808: void* jniCalloutArgs = !j9x 0x0000000000000000
    0x810: struct J9VMThread* exclusiveVMAccessQueueNext = !j9vmthread 0x0000000000000000
    0x818: struct J9VMThread* exclusiveVMAccessQueuePrevious = !j9vmthread 0x0000000000000000
    0x820: struct J9Object* javaLangThreadLocalCache = !j9object 0x0000000000000000
    0x828: UDATA jitCountDelta = 0x0000000000000002 (2)
    0x830: UDATA maxProfilingCount = 0x0000000000001771 (6001)
    0x838: j9objectmonitor_t[] objectMonitorLookupCache = !j9x 0x000000000025A538
    0x938: UDATA jniCriticalCopyCount = 0x0000000000000000 (0)
    0x940: UDATA jniCriticalDirectCount = 0x0000000000000000 (0)
    0x948: struct J9Pool* jniReferenceFrames = !j9pool 0x0000000000000000
    0x950: U32 ludclInlineDepth = 0x00000000 (0)
    0x954: U32 ludclBPOffset = 0x00000000 (0)
    0x958: UDATA* osrJittedFrameCopy = !j9x 0x0000000000000000
    0x960: struct J9OSRBuffer* osrBuffer = !j9osrbuffer 0x0000000000000000
    0x968: void* osrReturnAddress = !j9x 0x0000000000000000
    0x970: void* osrScratchBuffer = !j9x 0x0000000000000000
    0x978: void* jitArtifactSearchCache = !j9x 0x00007FD200009B00
    0x980: void* jitExceptionHandlerCache = !j9x 0x0000000000000000
    0x988: void* jitPrivateData = !j9x 0x00007FD240487EC0
    0x990: struct J9Method* jitMethodToBeCompiled = !j9method 0x0000000000000000
    0x998: UDATA privateFlags2 = 0x0000000000000000 (0)
    0x9a0: struct OMR_VMThread* omrVMThread = !omr_vmthread 0x000000000025A748
    0x9a8: void* gpuInfo = !j9x 0x0000000000000000
    0x9b0: void* startOfMemoryBlock = !j9x 0x0000000000259CA8
    0x9b8: UDATA inNative = 0x0000000000000000 (0)
    0x9c0: struct J9JITDecompilationInfo* lastDecompilation = !j9jitdecompilationinfo 0x0000000000000000
    0x9c8: struct J9GSParameters gsParameters = !j9gsparameters 0x000000000025A6C8
    0x9d8: UDATA readBarrierRangeCheckBase = 0xFFFFFFFFFFFFFFFF (-1)
    0x9e0: UDATA readBarrierRangeCheckTop = 0x0000000000000000 (0)
    0x9e8: U32 readBarrierRangeCheckBaseCompressed = 0xFFFFFFFF (4294967295)
    0x9ec: U32 readBarrierRangeCheckTopCompressed = 0x00000000 (0)
    0x9f0: UDATA safePointCount = 0x0000000000000000 (0)
    0x9f8: volatile struct J9HashTable* utfCache = !j9hashtable 0x0000000000000000
    0xa00: U64* ffiArgs = !j9x 0x0000000000000000
    0xa08: UDATA ffiArgCount = 0x0000000000000000 (0)
    0xa10: struct J9VMContinuation* currentContinuation = !j9vmcontinuation 0x0000000000000000
    0xa18: UDATA continuationPinCount = 0x0000000000000000 (0)
    0xa20: UDATA ownedMonitorCount = 0x0000000000000000 (0)
    0xa28: UDATA callOutCount = 0x0000000000000000 (0)
    0xa30: struct J9Object* carrierThreadObject = !j9object 0x00000000C00A3610 // jdk/internal/misc/CarrierThread
    0xa38: struct J9Object* extentLocalCache = !j9object 0x0000000000000000
}

In the J9VMThread, we see the JIT return address set as: 0x00007FD2434CD649, which is a region in the JIT code cache that is invalid. What I am now trying to understand now is what sets that field in the J9VMThread to this value, and where this is obtained from, and what it is supposed to be.

babsingh commented 1 year ago

What I am now trying to understand now is what sets that field in the J9VMThread to this value, and where this is obtained from, and what it is supposed to be.

While mounting a virtual thread, we copy the J9VMContinuation fields into J9VMThread and store the carrier thread fields in J9VMContinuation. While unmounting a virtual thread, we restore the carrier thread state by again swapping values into J9VMThread and caching the virtual thread specific-details in J9VMContinuation. We may need to cache the jitReturnAddress and other similar JIT fields in J9VMContinuation, which may need to be swapped during virtual thread mount and unmount. J9VMContinuation contains fields that are thread-specific and cannot be shared between a virtual thread and its carrier thread. See code associated to the below function for the above operations. Currently, jitReturnAddress is not part of the swap.

https://github.com/eclipse-openj9/openj9/blob/f21a361172f1dddf29ba198e01aa61699368c466/runtime/vm/ContinuationHelpers.hpp#L51-L84

gacholio commented 1 year ago

Ah, I see now - jitReturnAddress is used while in JIT helpers (cnathelp.cpp). I do not believe it's relevant to transitions. I've done a search and don't see it used anywhere else.

Even in the helpers, the value would be stored into a resolve frame before any potential calls into java could occur.

You could try swapping it to see if the assertion goes away, but I do not believe it will have any effect.

0xdaryl commented 1 year ago

@nbhuiyan : from the crash where you extracted the info above, can you prove whether the pop uword ptr J9TR_VMThread_jitReturnAddress[_rbp] wrote a correct value for the jitCheckAsyncMessages call? If yes then we're looking for something else to have stomped on that field.

I think prototyping adding a field to the J9VMContinuation struct and swapping the JIT return address should be easy enough to do. Could you prototype that just to rule that out as a cause?

gacholio commented 1 year ago

jitCheckAsyncMessages can not run java code, so it is not a swap point for continuations.

tajila commented 1 year ago

I think prototyping adding a field to the J9VMContinuation struct and swapping the JIT return address should be easy enough to do. Could you prototype that just to rule that out as a cause?

@nbhuiyan do you have an update

nbhuiyan commented 1 year ago

@tajila Yes, I have prototyped adding a jitReturnAddress field to J9VMContinuation struct, and using that to swap the JIT return address during virtual thread mounting and unmounting. I confirmed that the J9VMThread had jitReturnAddress field pointing to a valid address, however the stackWalkState still contained the invalid address that resulted in the assertion failure.

gacholio commented 1 year ago

This is what I expected.

nbhuiyan commented 1 year ago

Using -verbose:stackwalk, I was able to track down what the invalid address belonged to, and it was the PC for method ForkJoinPool$WorkQueue::topLevelExec. At the time of the crash, the address was invalid.

Here is a part of the list of compiled methods obtained from a core dump using jdmpview:

       start=0x7fba8126d648  end=0x7fba8126e025   java/util/concurrent/ForkJoinPool$WorkQueue::topLevelExec(Ljava/util/concurrent/ForkJoinTask;Ljava/util/concurrent/ForkJoinPool$WorkQueue;)V
    start=0x7fba81292708  end=0x7fba81293221   java/util/concurrent/ForkJoinPool$WorkQueue::topLevelExec(Ljava/util/concurrent/ForkJoinTask;Ljava/util/concurrent/ForkJoinPool$WorkQueue;)V

There are similar methods with more than one entry in the list, such as ForkJoinPool.scan. When checking the core dump for what’s within the address range of the first entry of a JIT-ed method, it is invalid, but for the second entry’s address range, it is a valid address for the compiled method, meaning the second entry is due to recompilation, as confirmed by the JIT verbose log:

! (AOT load) java/util/concurrent/ForkJoinPool$WorkQueue.topLevelExec(Ljava/util/concurrent/ForkJoinTask;Ljava/util/concurrent/ForkJoinPool$WorkQueue;)V Q_SZ=42 Q_SZI=42 QW=180 j9m=00000000001FF940 time=107us compilationRelocationFailure (classByNameValidationFailure) memLimit=262144 KB compThreadID=0
+ (cold) java/util/concurrent/ForkJoinPool$WorkQueue.topLevelExec(Ljava/util/concurrent/ForkJoinTask;Ljava/util/concurrent/ForkJoinPool$WorkQueue;)V @ 00007FBA8126E088-00007FBA8126E245 OrdinaryMethod - Q_SZ=42 Q_SZI=42 QW=180 j9m=00000000001FF940 bcsz=83 GCR compThreadID=0 CpuLoad=9%(1%avg) JvmCpu=0%
+ (warm) java/util/concurrent/ForkJoinPool$WorkQueue.topLevelExec(Ljava/util/concurrent/ForkJoinTask;Ljava/util/concurrent/ForkJoinPool$WorkQueue;)V @ 00007FBA81292708-00007FBA81293221 OrdinaryMethod C Q_SZ=0 Q_SZI=0 QW=12 j9m=00000000001FF940 bcsz=83 compThreadID=0 CpuLoad=0%(0%avg) JvmCpu=56%

The warm compilation of topLevelExec was also the final entry in the verbose log, meaning this was the last method to successfully compile prior to the crash.

So for the crash, we have

*** Invalid JIT return address 00007FBA8126E105 in 00007FBA7AA4C6B0

and the jit return address is immediately after the JIT-ed method address range of the first entry of topLevelExec. In the core dump, that address was invalid, but within the range of the JIT code cache. The stack walk results in an error due to this invalid address in 00007FBA7AA4C6B0, which is the walkState object.

Here is the info we get using KCA on the J9Method:

Method   {ClassPath/Name.MethodName}: {java/util/concurrent/ForkJoinPool$WorkQueue.topLevelExec}
                           Signature: (Ljava/util/concurrent/ForkJoinTask;Ljava/util/concurrent/ForkJoinPool$WorkQueue;)V
                              Access: Final
                    J9Class/J9Method: 0x00000000001ff500 / 0x00000000001ff940
               Compiled Method Start: 0x00007fba81292708 (2909 bytes)
                      ByteCode Start: 0x00007fba582ce73c (83 bytes)
                   ROM Constant Pool: 0x00007fba582cdb30 (79 entries)
                       Constant Pool: 0x00000000001feec0 (79 entries)

As you can see, the compiled method start is the same value as the compiled method start of the second entry of topLevelExec reported by jdmpview. So somehow the walkState object needed to be updated with the new address for topLevelExec, but that’s what is not happening.

0xdaryl commented 1 year ago

The failure is relatively reproducible (I'd say 9/10 times with Nazim's setup). I'll just record an observation here while the investigation proceeds.

Just prior (I don't mean immediately before but very recently before) the failing stack walk we performed a successful stack walk on the same J9VMThread with the same walkState. The J9VMThread is not of the typical 256-byte aligned variety and doesn't show up in the thread list (address is 00007FA366682C40) but other than that appears to be formed correctly. The diffs of those stackwalks are nearly identical until the crash point (except for addresses of some objects). In both cases the method on top of the stack is jdk/internal/vm/Continuation.enterImpl()Z but the receivers are slightly different. The object in the failing walk is linked to in the continuationLink field of the first object.

First walk (successful)

!J9Object 0x00000000C00C03A0 {
        struct J9Class* clazz = !j9class 0x216F00 // java/lang/VirtualThread$VThreadContinuation
        Object flags = 0x00000010;
        I lockword = 0x00000000 (offset = 0) (java/lang/Object) <hidden>
        J vmRef = 0x00007FA388271090 (offset = 4) (jdk/internal/vm/Continuation)
        Ljdk/internal/vm/ContinuationScope; scope = !fj9object 0xc0054be8 (offset = 12) (jdk/internal/vm/Continuation)
        Ljava/lang/Runnable; runnable = !fj9object 0xc00c03f0 (offset = 16) (jdk/internal/vm/Continuation)
        Ljdk/internal/vm/Continuation; parent = !fj9object 0x0 (offset = 20) (jdk/internal/vm/Continuation)
        Z started = 0x00000001 (offset = 24) (jdk/internal/vm/Continuation)
        Z finished = 0x00000000 (offset = 28) (jdk/internal/vm/Continuation)
        Z isAccessible = 0x00000001 (offset = 32) (jdk/internal/vm/Continuation)
        I continuationLink = 0xC00C0350 (offset = 36) (jdk/internal/vm/Continuation) <hidden>
}

Second walk (fail)

!J9Object 0x00000000C00C0350 {
        struct J9Class* clazz = !j9class 0x216F00 // java/lang/VirtualThread$VThreadContinuation
        Object flags = 0x00000010;
        I lockword = 0x00000000 (offset = 0) (java/lang/Object) <hidden>
        J vmRef = 0x000055EB0968A7D0 (offset = 4) (jdk/internal/vm/Continuation)
        Ljdk/internal/vm/ContinuationScope; scope = !fj9object 0xc0054be8 (offset = 12) (jdk/internal/vm/Continuation)
        Ljava/lang/Runnable; runnable = !fj9object 0xc00c03e0 (offset = 16) (jdk/internal/vm/Continuation)
        Ljdk/internal/vm/Continuation; parent = !fj9object 0x0 (offset = 20) (jdk/internal/vm/Continuation)
        Z started = 0x00000001 (offset = 24) (jdk/internal/vm/Continuation)
        Z finished = 0x00000000 (offset = 28) (jdk/internal/vm/Continuation)
        Z isAccessible = 0x00000000 (offset = 32) (jdk/internal/vm/Continuation)
        I continuationLink = 0x00000000 (offset = 36) (jdk/internal/vm/Continuation) <hidden>
}

I'm trying to track down where that savedPC came from. It appears to be pointing to valid JIT code (or at least it was valid at some point) but doesn't correspond to any method.

0xdaryl commented 1 year ago

In terms of the bad PC, it is in the range of a valid method (java/util/concurrent/ForkJoinPool$WorkQueue.topLevelExec) compiled at warm. However, in every case that has failed so far I observe we do an unsuccessful "profiled hot" compile of that method followed by a second "profiled hot" compile that succeeds. Barfing begins on the next stack walk after that profiled hot compile succeeds. It isn't clear to me yet why the first profiled compile fails and why the second succeeds, but the pattern is consistent enough that I suspect it is contributing to the problem in some way.

gacholio commented 1 year ago

Under what circumstances do you remove/invalidate metadata for a compiled method? The assertion is triggered when metadata for the PC can not be located.

0xdaryl commented 1 year ago

The problem is triggered by code cache reclamation. Code reclamation occurs after a method is recompiled and the old body is provably not reachable. The JIT proves this by hooking the GC exit events and walking the stacks of all active threads looking for the presence of the old body [1]. If the body isn't found on any thread stack then the method's code can be reclaimed.

The method in question for this problem is java/util/concurrent/ForkJoinPool$WorkQueue.topLevelExec and the JIT does in fact reclaim the original compiled body. I added some instrumentation to the jitReleaseCodeStackWalk function to report the J9VMThreads and stacks being scanned and the stack where the "Invalid JIT Return Address" is being reported is not among them. Hence, the JIT concludes the original body is not reachable.

Presumably there are thread stacks that are not attached to any J9VMThread that need to be scanned for live JITed method bodies. Can you confirm that statement is true? If so, is there a way to get a handle on those so the JIT can walk them? @fengxue-IS @babsingh @tajila

[1] https://github.com/eclipse-openj9/openj9/blob/8206cebaa7a669a73a28a4664298a36032951865/runtime/compiler/control/HookedByTheJit.cpp#L6460

tajila commented 1 year ago

Presumably there are thread stacks that are not attached to any J9VMThread that need to be scanned for live JITed method bodies. Can you confirm that statement is true?

Yes this is true

if so, is there a way to get a handle on those so the JIT can walk them?

We will need to create a helper that iterates all continuations, or perhaps provide a more general helper to iterate all stacks.

tajila commented 1 year ago

The only way to find all stacks currently is to walk the vmthreadlist (as is currently being done) and then do another pass with the virtualthread list.

tajila commented 1 year ago

@fengxue-IS I think we need something like:

Iterates over all virtualthreads and platform threads. Both of which are represented as J9VMThreads. In the case 
of virtualThreads, the thread state is copied from the J9VMContinuation in to the J9VMThread via 
`copyFieldsFromContinuation`.

@param currentThread
@param userData to pass to callback
@param callback function

void
iterateAllVirtualAndPlatformThreads(J9VMThread *currentThread, void *userData, void (*callback)(J9VMThread *walkThread, void *userData) callback)

gacholio commented 1 year ago

We can't do that as we are eliminating the fake J9VMThreads for continuations.

We will need to be able to find all continuations in some way - if continuations are mapped 1-1 with vthreads then walking the existing vthread list would do it.

fengxue-IS commented 1 year ago

gc maintans a list of all continuation objects allocated. we can ask for a iterator to walk all native stack associated with continuaton objects. @LinHu2016 fyi

tajila commented 1 year ago

Sounds like the short term answer is to do


        omrthread_monitor_enter(vm->liveVirtualThreadListMutex);
        while (vm->inspectingLiveVirtualThreadList) {
            /* Virtual thread list is being inspected, wait. */
            vmFuncs->internalExitVMToJNI(currentThread);
            omrthread_monitor_wait(vm->liveVirtualThreadListMutex);
            vmFuncs->internalEnterVMFromJNI(currentThread);
        }
        vm->inspectingLiveVirtualThreadList = TRUE;
        omrthread_monitor_exit(vm->liveVirtualThreadListMutex);
        if (NULL != vm->liveVirtualThreadList) {
                        /* Skip the root, which is a dummy virtual thread and global ref. */
            j9object_t walkVirtualThread = J9OBJECT_OBJECT_LOAD(currentThread, root, vm->virtualThreadLinkNextOffset);
            do {
                              J9VMThread stackThread = {0}; 
                              J9VMEntryLocalStorage els = {0};
                              j9object_t contObject = (j9object_t)J9VMJAVALANGVIRTUALTHREAD_CONT(currentThread, walkVirtualThread);
                  continuation = J9VMJDKINTERNALVMCONTINUATION_VMREF(currentThread, contObject);                                  
                              vm->internalVMFunctions->copyFieldsFromContinuation(currentThread, &stackThread, &els, continuation);

                                 //add your code here to operate on stackThread

                                 walkVirtualThread = J9OBJECT_OBJECT_LOAD(currentThread, walkVirtualThread, vm->virtualThreadLinkNextOffset);
                        } while (root != walkVirtualThread);
        }
        omrthread_monitor_enter(vm->liveVirtualThreadListMutex);
        vm->inspectingLiveVirtualThreadList = FALSE;
        omrthread_monitor_notify_all(vm->liveVirtualThreadListMutex);
        omrthread_monitor_exit(vm->liveVirtualThreadListMutex);

tajila commented 1 year ago

If this is the solution, I dont thinking fixing it should be held up by the work Gengchen is doing.

gc maintans a list of all continuation objects allocated. we can ask for a iterator to walk all native stack associated with continuaton objects.

Long term this is probably the best approach. I think we need a stack iterator, it doesnt have to operate on J9VMthreads though.

gacholio commented 1 year ago

Something like walkAllStackFrames could encapsulate this by walking the thread list and calling walkStackFrames then walking the continuation list and using walkContinuationStackFrames - the walkState and iterators can (by design) be shared between the two cases.

This could be implemented now and will not be affected by Gengchen's work (because it is not changing the walking APIs, just the implemenation).

fengxue-IS commented 1 year ago

Something like walkAllStackFrames could encapsulate this by walking the thread list and calling walkStackFrames then walking the continuation list and using walkContinuationStackFrames - the walkState and iterators can (by design) be shared between the two cases.

This could be implemented now and will not be affected by Gengchen's work (because it is not changing the walking APIs, just the implemenation).

Such API should be always under exclusive VMAccess, so we wouldn't need to aquire vthread related mutex/counter. we can share the same callback and stackwalk flags as input arguments

I can draft something for testing/review by Monday if we agree on this approach.

fengxue-IS commented 1 year ago

Note, I checked the code where jitReleaseCodeStackWalk is being called, looks like it can also occur under local gc which may not be holding exclusive access. in such cases, we may not be able to determine the correct continuation stack to scan since multiple mount/unmount can take place on the same carrier thread (may be possible to remounted on a different carrier thread).

gacholio commented 1 year ago

The GC start/end events are always sent while holding exclusive.

fengxue-IS commented 1 year ago

After taking a second look at how jitReleaseCodeStackWalk works, I realized that adding an API like walkAllStackFrames will only work with non-realtimeGC. There is a special logic used by JIT under realtimeGC/incrementalCCR that require custom actions performed between walking different stacks.

For those cases, it will not be possible to provide a generic API to cover the JIT use case. Special code will need to be added to ensure continuation stacks are properly walked. And we also need to add a new field to continuation for setting a flag to denote if it has been walked or not.

I am also not clear on how condYieldFromGCFunctionPtr is suppose to work (does it pause the currentThread and allow other threads to run? ie. release exclusive access and re-acquire). In such cases, if continuation is allowed to mount/unmounted, the state will become more complicated and result may not be what is expected. @LinHu2016 do you have any insight to the realtime GC case?

0xdaryl commented 1 year ago

I am in the process of prototyping a solution that uses the skeleton code Tobi provided above right in jitReleaseCodeStackWalk, and I noticed the special handling for thread yields for realtime and was curious how your implementation was going to handle that. If my prototype works then it may be an adequate solution for JDK 19 (and even beyond).

tajila commented 1 year ago

and I noticed the special handling for thread yields for realtime and was curious how your implementation was going to handle that.

I think if we are doing this in two passes, one for J9VMThreads, and one for virtualthreads, then we dont need any special handling for that. Any stack that is found via the virtualthread list is not currently active (in unmounted/yield state).

0xdaryl commented 1 year ago

I'll just confirm that my prototype fix appears to resolve the problem.

fengxue-IS commented 1 year ago

local testing with #16374 against Skynet.java test: 5/5 passes for default GC (gencon) 3/3 passed for metronome(I noticed that there may be upto 30x regression compared with gencon is test performance) 1/1 crashed when running balance GC.

Not sure if crash in BalancedGC is an intermittent issue or not, have not seen on other GC policy( or I haven't ran enough times to trigger it)

Updated commit: https://github.com/eclipse-openj9/openj9/pull/16374/commits/dfebbcd7d84ad5512b40e8fbf7628142077538e1 I modified the realtimeGC code to double scan the vmthread and avoid locking vthread mount/unmount, seem like that fixed the issue with pref regression. 2/2 passed for gencon 2/2 passed for metronome 1/1 Crash in Balanced GC still exist, will look into core later

0xdaryl commented 1 year ago

How shall we proceed with this?

I have a fix that implements the stack walk in place in jitReleaseCodeStackWalk that has passed testing for JDK 11 and 19 on all platforms. I am ready to open a PR for that.

Alternatively, if your universal stack walk feature will be available soon then I can incorporate that instead.

However, given where we are in the release cycle I would prefer the first option because it is less disruptive in terms of code changes. I'm worried about introducing subtle problems that will be hard to track down late in the release.

gacholio commented 1 year ago

@fengxue-IS We need a definitive answer on whether all continuations can currently be found via the vthread list.

tajila commented 1 year ago

@fengxue-IS We need a definitive answer on whether all continuations can currently be found via the vthread list.

For JDK19 this is the case.

gacholio commented 1 year ago

OK so the current solution will work. In the future, the code really belongs in a VM helper, but that's complicated by realtime (which is basically the antithesis or loom anyway).

tajila commented 1 year ago

I've renamed the issue to more accurated represent the problem

pshipton commented 1 year ago

https://openj9-jenkins.osuosl.org/job/Test_openjdk19_j9_sanity.openjdk_ppc64_aix_Nightly/63/ jdk_lang_1 java/lang/Thread/virtual/stress/GetStackTraceALot.java#id0

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk19_j9_sanity.openjdk_ppc64_aix_Nightly/63/openjdk_test_output.tar.gz

23:45:13  STDERR:
23:45:13  
23:45:13  
23:45:13  *** Invalid JIT return address 0000000000000000 in 000001002312DE00
23:45:13  
23:45:13  04:41:21.095 0x1002312db00    j9vm.249    *   ** ASSERTION FAILED ** at /home/jenkins/workspace/Build_JDK19_ppc64_aix_Nightly/openj9/runtime/vm/swalk.c:1632: ((0 ))

0xdaryl commented 1 year ago

Just FYI that my workaround fix also crashed with balanced GC with the same backtrace as Jack's PR.

pshipton commented 1 year ago

https://openj9-jenkins.osuosl.org/job/Test_openjdk19_j9_sanity.openjdk_s390x_linux_Nightly/67 jdk_lang_1 java/lang/Thread/virtual/stress/GetStackTraceALot.java#id0

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk19_j9_sanity.openjdk_s390x_linux_Nightly/67/openjdk_test_output.tar.gz

22:05:16  *** Invalid JIT return address 000003FF696C6E14 in 000003FFA8A7E190
22:05:16  
22:05:16  03:04:47.621 0x3ffa4508600    j9vm.249    *   ** ASSERTION FAILED ** at /home/jenkins/workspace/Build_JDK19_s390x_linux_Nightly/openj9/runtime/vm/swalk.c:1632: ((0 ))

eclipse-openj9 / openj9

Loom: jdk19 OpenJDK Invalid JIT return address ASSERTION FAILED swalk.c:1601 #15939