Closed JasonFengJ9 closed 6 months ago
vmState [0x51dff]: {J9VMSTATE_JIT} {deadTreesElimination}
@hzongaro fyi
Grinder results Failed 7/10 on win11x86-svl-rt8-1 Passed on win11x86-rtp-rt3-1 Failed 6/10 on win10x64vs6 Failed 4/10 on win16x64rt2-8 Failed 6/10 on win19x86-svl-rt7-1
This looks reminiscent of a series of cases I've been working on where we appear to have corrupted 64 bit pointers with small int values in the low 32 bits. InaccessibleReadAddress=0000008A0000001A R8=0000008A00000001
It looks like it's crashing while walking through longestPaths
at the end of TR::DeadTreesElimination::process
, deallocating its contents. It crashes in std::_Tree_val::_Erase_tree
at 0x7FFC5441CCEF called from 0x7FFC5477979E:
j9jit29!std::_Tree_val<std::_Tree_simple_types<std::pair<int const ,TR_Stack<TR::SymbolReference *> *> > >::_Erase_tree<TR::typed_allocator<std::_Tree_node<std::pair<int const ,TR_Stack<TR::SymbolReference *> *>,void *>,TR::Region &> >:
00007ffc`5441cce0 48895c2408 mov qword ptr [rsp+8], rbx
00007ffc`5441cce5 4889742410 mov qword ptr [rsp+10h], rsi
00007ffc`5441ccea 57 push rdi
00007ffc`5441cceb 4883ec20 sub rsp, 20h
00007ffc`5441ccef 4180781900 cmp byte ptr [_Rootnode->_Isnil (r8+19h)], 0 <<<<< Crash here
from end of TR::DeadTreesElimination::process
:
00007ffc`5477978a 7538 jne j9jit29!TR::DeadTreesElimination::process+0x11a4 (7ffc547797c4)
00007ffc`5477978c 0f1f4000 nop dword ptr [this{->_optionSets(!!)} (rax)]
00007ffc`54779790 4c8b4310 mov r8, qword ptr [_Rootnode->_Right (rbx+10h)]
00007ffc`54779794 488d542470 lea longestPaths (rdx), [longestPaths{._Mypair._Myval2._Myval1._backingAllocator} (rsp+70h)]
00007ffc`54779799 488d4c2478 lea rcx, [longestPaths._Mypair._Myval2._Myval2{._Myhead} (rsp+78h)]
00007ffc`5477979e e83d35caff call j9jit29!std::_Tree_val<std::_Tree_simple_types<std::pair<int const , TR_Stack<TR::SymbolReference *> *> > >::_Erase_tree<TR::typed_allocator<std::_Tree_node<std::pair<int const , TR_Stack<TR::SymbolReference *> *>, void *>, TR::Region &> > (7ffc5441cce0)
00007ffc`547797a3 488bd3 mov rdx, _Rootnode (rbx)
00007ffc`547797a6 488b1b mov _Rootnode (rbx), qword ptr [_Rootnode (rbx)]
00007ffc`547797a9 41b830000000 mov r8d, 30h
00007ffc`547797af 488b4c2470 mov rcx, qword ptr [longestPaths{._Mypair._Myval2._Myval1._backingAllocator} (rsp+70h)]
00007ffc`547797b4 e86782f9ff call j9jit29!TR::Region::deallocate (7ffc54711a20)
00007ffc`547797b9 807b1900 cmp byte ptr [_Rootnode->_Isnil (rbx+19h)], 0
00007ffc`547797bd 74d1 je j9jit29!TR::DeadTreesElimination::process+0x1170 (7ffc54779790)
I'm trying to add a small debugging method that will walk through longestPaths
periodically throughout DeadTreesElimination::process
if tracing is enabled in hopes of narrowing down where it gets corrupted. Hopefully that doesn't end up masking the problem.
I think I have figured out what is happening. In OMR::DeadTreesElimination::process(TR::TreeTop*,TR::TreeTop*)
, a TR::StackMemoryRegion
object is created, and it is used to allocate objects named longestPaths
and anchors
. The process
method calls isSafeToReplaceNode
, passing _targetTrees
, which is a List<OMR::TreeInfo>
, as an argument. In turn, that function calls findOrCreateTreeInfo
, passing the _targetTrees
object along.
findOrCreateTreeInfo
will look for an OMR::TreeInfo
object in _targetTrees
that refers to the current TR::TreeTop *
, or it will create a new OMR::TreeInfo
object in stack memory. However, _targetTrees
is initialized in the constructor of DeadTreesElimination
, and its lifetime extends past the call to OMR::DeadTreesElimination::process(TR::TreeTop*,TR::TreeTop*)
. Thus OMR::TreeInfo
instances in the List
might have been allocated from stack memory that has already been released, and whose storage is now being reused by other data structures, the aforementioned longestPaths
and anchors
. The anchors
structure contains TR::TreeTop
pointers, so if such a pointer is stored into memory that's shared with the _treeTop
field of an OMR::TreeInfo
that was allocated from that now reused stack memory, findOrCreateTreeInfo
will incorrectly return that OMR::TreeInfo
object for a TR::TreeTop *
that happens to match it. Then a call to set the _height
field of that OMR::TreeInfo
will corrupt whatever data was at that position in the anchors
structure that happened to share memory with the now obsolete OMR::TreeInfo
, ultimately leading to a crash.
I believe the fix is to allocate the OMR::TreeInfo
objects using the TR::Region
that is associated with _targetTrees
. I am testing that fix.
This also occurs in the 0.44 release. https://openj9-jenkins.osuosl.org/job/Test_openjdk11_j9_extended.functional_x86-64_windows_Release_testList_1/91 - win2019x64-openj9-3a threadMXBeanTestSuite1_7
17:59:36 Type=Segmentation error vmState=0x00051dff
17:59:36 Windows_ExceptionCode=c0000005 J9Generic_Signal=00000004 ExceptionAddress=00007FFB3FDFCCDF ContextFlags=0010005f
17:59:36 Handler1=00007FFB3FE3E550 Handler2=00007FFB43ECABA0 InaccessibleReadAddress=0000005E0000001A
17:59:36 RDI=0000005EF4AF90D0 RSI=00007FF4F4E42FB0 RAX=0000000000000000 RBX=0000005EF4AF9200
17:59:36 RCX=0000005EF4AF9070 RDX=0000005EF4AF9068 R8=0000005E00000001 R9=00007FF4F4E60260
17:59:36 R10=0000000000000007 R11=0000000000000001 R12=0000000000000000 R13=00007FF4F4E439B0
17:59:36 R14=0000000000000054 R15=0000000000000001
17:59:36 RIP=00007FFB3FDFCCDF RSP=0000005EF4AF8FC0 RBP=0000005EF4AF90F0 EFLAGS=0000000000010206
17:59:36 FS=0053 ES=002B DS=002B
17:59:36 XMM0 0000005ef4af91d0 (f: 4105146880.000000, d: 2.014958e-312)
17:59:36 XMM1 0000000000000000 (f: 0.000000, d: 0.000000e+00)
17:59:36 XMM2 4564657470757272 (f: 1886745216.000000, d: 1.972610e+26)
17:59:36 XMM3 00007ff4f4f00b78 (f: 4109372160.000000, d: 6.951012e-310)
17:59:36 XMM4 00007ff4f4f00db8 (f: 4109372928.000000, d: 6.951012e-310)
17:59:36 XMM5 0000000000000000 (f: 0.000000, d: 0.000000e+00)
17:59:36 XMM6 0000000000000000 (f: 0.000000, d: 0.000000e+00)
17:59:36 XMM7 0000000000000000 (f: 0.000000, d: 0.000000e+00)
17:59:36 XMM8 0000000000000000 (f: 0.000000, d: 0.000000e+00)
17:59:36 XMM9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
17:59:36 XMM10 0000000000000000 (f: 0.000000, d: 0.000000e+00)
17:59:36 XMM11 0000000000000000 (f: 0.000000, d: 0.000000e+00)
17:59:36 XMM12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
17:59:36 XMM13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
17:59:36 XMM14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
17:59:36 XMM15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
17:59:36 Module=C:\Users\jenkins\workspace\Test_openjdk11_j9_extended.functional_x86-64_windows_Release_testList_1\jdkbinary\j2sdk-image\bin\default\j9jit29.dll
17:59:36 Module_base_address=00007FFB3FDD0000 Offset_in_DLL=000000000002ccdf
17:59:36
17:59:36 Method_being_compiled=org/openj9/test/java/lang/management/ThreadMXBean/FindDeadlockTest$DSThread.run()V
17:59:36 Target=2_90_20240413_132 (Windows Server 2019 10.0 build 17763)
17:59:36 CPU=amd64 (4 logical CPUs) (0x3fff77000 RAM)
17:59:36 ----------- Stack Backtrace -----------
17:59:36 (0x00007FFB3FDFCCDF [j9jit29+0x2ccdf])
17:59:36 Java_java_lang_invoke_ThunkTuple_initialInvokeExactThunk+0x2bdeb2 (0x00007FFB40159072 [j9jit29+0x389072])
17:59:36 Java_java_lang_invoke_ThunkTuple_initialInvokeExactThunk+0x2bbafc (0x00007FFB40156CBC [j9jit29+0x386cbc])
17:59:36 Java_java_lang_invoke_ThunkTuple_initialInvokeExactThunk+0x3b0530 (0x00007FFB4024B6F0 [j9jit29+0x47b6f0])
17:59:36 Java_java_lang_invoke_ThunkTuple_initialInvokeExactThunk+0x3acf75 (0x00007FFB40248135 [j9jit29+0x478135])
17:59:36 Java_java_lang_invoke_ThunkTuple_initialInvokeExactThunk+0x2494e7 (0x00007FFB400E46A7 [j9jit29+0x3146a7])
17:59:36 (0x00007FFB3FE2BE0A [j9jit29+0x5be0a])
17:59:36 (0x00007FFB3FE2F3F8 [j9jit29+0x5f3f8])
17:59:36 j9port_isCompatible+0x18a46 (0x00007FFB43ECB626 [j9prt29+0x1b626])
17:59:36 j9port_isCompatible+0x1a180 (0x00007FFB43ECCD60 [j9prt29+0x1cd60])
17:59:36 (0x00007FFB3FE2B56E [j9jit29+0x5b56e])
17:59:36 (0x00007FFB3FE318A0 [j9jit29+0x618a0])
17:59:36 (0x00007FFB3FE3128A [j9jit29+0x6128a])
17:59:36 (0x00007FFB3FE3E380 [j9jit29+0x6e380])
17:59:36 j9port_isCompatible+0x1a1bb (0x00007FFB43ECCD9B [j9prt29+0x1cd9b])
17:59:36 (0x00007FFB3FE3E076 [j9jit29+0x6e076])
17:59:36 omrthread_get_category+0xa42 (0x00007FFB48484242 [j9thr29+0x4242])
17:59:36 _o_exp+0x5a (0x00007FFB527E268A [ucrtbase+0x2268a])
17:59:36 BaseThreadInitThunk+0x14 (0x00007FFB54B67AC4 [KERNEL32+0x17ac4])
17:59:36 RtlUserThreadStart+0x21 (0x00007FFB557FA4E1 [ntdll+0x5a4e1])
This also occurs in the 0.44 release.
It's not a new problem - I believe it was first introduced in the 0.9.0 release, but it's highly intermittent.
@pshipton, @JamesKingdon, do you feel we should port the fix for this into the 0.44 release as well, or is it safe to wait for the 0.46 release?
This isn't a new problem, regression, or a big customer impact? We should wait for 0.46. We've already done the Release Candidate 1 builds for 0.44. Adding anything is going to delay the release.
Does it make a difference to when it would get into IBM sdk?
The fix is not in 24_02, or 24_02u1, it will be in 24_03. It could be double delivered to 24_02u1, there is time.
Fixed by eclipse/omr#7305
Failure link
From an internal build(
win19x86-rtp-rt7-1
):Rerun in Grinder - Change TARGET to run only the failed test targets.
Optional info
Failure output (captured from console output)
50x internal grinder - https://github.com/eclipse-openj9/openj9/issues/19197#issuecomment-2011073517