Closed amanasifkhalid closed 6 months ago
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
This only failed once, so I'm going to unmark it as blocking (also doesn't repro easily it looks like).
From the dump it definitely looks OSR related. The stack trace looks like the following:
(lldb) clrstack -f
OS Thread Id: 0xf99 (1)
Child SP IP Call Site
000072F82D64AFA0 000072F82D077613 libc.so.6!___lldb_unnamed_symbol3325 + 115
000072F82D64D100 000072F82D0606CA libc.so.6!fprintf + 154
000072F82D64D1E0 000072F82CEB4787 libcoreclr.so!PROCCreateCrashDump(std::vector<char const*, std::allocator<char const*> >&, char*, int, bool) + 1111 at /__w/1/s/src/coreclr/pal/src/thread/process.cpp:2320
000072F82D64D250 000072F82CEB5A8E libcoreclr.so!PROCCreateCrashDumpIfEnabled + 2942 at /__w/1/s/src/coreclr/pal/src/thread/process.cpp:0
000072F82D64D2E0 000072F82CE533A9 libcoreclr.so!invoke_previous_action(sigaction*, int, siginfo_t*, void*, bool) + 377 at /__w/1/s/src/coreclr/pal/src/exception/signal.cpp:397
000072F82D64D330 000072F82CE525F5 libcoreclr.so!sigsegv_handler(int, siginfo_t*, void*) + 341 at /__w/1/s/src/coreclr/pal/src/exception/signal.cpp:631
000072F82D64D380 000072F82D042520 libc.so.6!___lldb_unnamed_symbol3237 + 1
00007FFF40434DC8 0000000000000000
00007FFF40434DD0 000072F7B0BDA404
00007FFF40434E80 000072F7B0BDAEBF
00007FFF40434EC0 000072F7B0BD46D4
00007FFF404350D0 000072F7B0B8BCD2
00007FFF40435170 000072F7B0B7D6E0
(not sure why managed names are missing. Opened https://github.com/dotnet/diagnostics/issues/4637.)
The previous valid return address is:
(lldb) ip2md 72F7B0BDA404
MethodDesc: 000072f7b0a57a90
Method Name: ILCompiler.LazyGenericsSupport+GraphBuilder.<.ctor>g__AddProcessType|0_1(Internal.TypeSystem.TypeDesc, <>c__DisplayClass0_0 ByRef)
Class: 000072f7b0a621e8
MethodTable: 000072f7b0a57bc8
mdToken: 0000000006000DB6
Module: 000072f7af160888
IsJitted: yes
Current CodeAddr: 000072f7b0bda3a0
Version History:
ILCodeVersion: 0000000000000000
ReJIT ID: 0
IL Addr: 000072f7ac0b796c
CodeAddr: 000072f7b0916f40 (OptimizedTier1OSR)
NativeCodeVersion: 00005C2444DC3C00
CodeAddr: 000072f7b0bda3a0 (QuickJitted + Instrumented)
NativeCodeVersion: 000072B6F8589A80
CodeAddr: 000072f7b0bdb4c0 (OptimizedTier1OSR)
NativeCodeVersion: 00005C2444DE84A0
CodeAddr: 0000000000000000 (OptimizedTier1)
NativeCodeVersion: 000072B6F86842E0
CodeAddr: 000072f7b08f0640 (QuickJitted)
NativeCodeVersion: 0000000000000000
...
(lldb) clru 72F7B0BDA404
Normal JIT generated code
ILCompiler.LazyGenericsSupport+GraphBuilder.<.ctor>g__AddProcessType|0_1(Internal.TypeSystem.TypeDesc, <>c__DisplayClass0_0 ByRef)
ilAddr is 000072F7AC0B796C pImport is 000000000427EC30
Begin 000072F7B0BDA3A0, size 26d
000072f7b0bda3a0 55 push rbp
000072f7b0bda3a1 4881eca0000000 sub rsp, 0xa0
000072f7b0bda3a8 488dac24a0000000 lea rbp, [rsp + 0xa0]
000072f7b0bda3b0 c4413857c0 vxorps xmm8, xmm8, xmm8
000072f7b0bda3b5 c57e7f8560ffffff vmovdqu ymmword ptr [rbp - 0xa0], ymm8
000072f7b0bda3bd c57e7f4580 vmovdqu ymmword ptr [rbp - 0x80], ymm8
000072f7b0bda3c2 c57e7f45a0 vmovdqu ymmword ptr [rbp - 0x60], ymm8
000072f7b0bda3c7 33c0 xor eax, eax
000072f7b0bda3c9 488945c0 mov qword ptr [rbp - 0x40], rax
000072f7b0bda3cd 48897dd0 mov qword ptr [rbp - 0x30], rdi
000072f7b0bda3d1 488975c8 mov qword ptr [rbp - 0x38], rsi
000072f7b0bda3d5 c78578ffffff01000000 mov dword ptr [rbp - 0x88], 0x1
000072f7b0bda3df 8b8578ffffff mov eax, dword ptr [rbp - 0x88]
000072f7b0bda3e5 ffc8 dec eax
000072f7b0bda3e7 898578ffffff mov dword ptr [rbp - 0x88], eax
000072f7b0bda3ed 83bd78ffffff00 cmp dword ptr [rbp - 0x88], 0x0
000072f7b0bda3f4 7f0e jg 0x72f7b0bda404
000072f7b0bda3f6 488dbd78ffffff lea rdi, [rbp - 0x88]
000072f7b0bda3fd 33f6 xor esi, esi
000072f7b0bda3ff e82ce1ee7b call 0x72f82cac8530 (JitHelp: CORINFO_HELP_PATCHPOINT)
>>> 000072f7b0bda404 488b7dd0 mov rdi, qword ptr [rbp - 0x30]
000072f7b0bda408 33f6 xor esi, esi
So seems like something went wrong while transitioning from the patchpoint.
@AndyAyersMS do you have any thoughts? I don't suppose there are some diagnostics available to e.g. obtain the particular PerPatchpointInfo
relevant here?
The PerPatchpointInfo
is this one:
(lldb) p *(PerPatchpointInfo*)0x000072f7b0cc0638
(PerPatchpointInfo) $17 = {
m_osrMethodCode = 126408147711168
m_patchpointCount = 2
m_flags = 1
m_patchpointId = 1538
}
That method code corresponds to
CodeAddr: 000072f7b0bdb4c0 (OptimizedTier1OSR)
NativeCodeVersion: 00005C2444DE84A0
seen above. Unclear whether the issue here is OSR related or whether something bad just happened after transitioning.
This is the code for the OSR version:
(lldb) clru 72f7b0bdb4c0
Normal JIT generated code
ILCompiler.LazyGenericsSupport+GraphBuilder.<.ctor>g__AddProcessType|0_1(Internal.TypeSystem.TypeDesc, <>c__DisplayClass0_0 ByRef)
ilAddr is 000072F7AC0B796C pImport is 00007ECC5801F550
Begin 000072F7B0BDB4C0, size 147
>>> 000072f7b0bdb4c0 488b4500 mov rax, qword ptr [rbp]
000072f7b0bdb4c4 50 push rax
000072f7b0bdb4c5 4883ec20 sub rsp, 0x20
000072f7b0bdb4c9 4c89bc24c8000000 mov qword ptr [rsp + 0xc8], r15
000072f7b0bdb4d1 4c89b424c0000000 mov qword ptr [rsp + 0xc0], r14
000072f7b0bdb4d9 48899c24b8000000 mov qword ptr [rsp + 0xb8], rbx
000072f7b0bdb4e1 488d6c2420 lea rbp, [rsp + 0x20]
000072f7b0bdb4e6 33c0 xor eax, eax
000072f7b0bdb4e8 488945e0 mov qword ptr [rbp - 0x20], rax
000072f7b0bdb4ec 488b9d80000000 mov rbx, qword ptr [rbp + 0x80]
000072f7b0bdb4f3 4c8b7d78 mov r15, qword ptr [rbp + 0x78]
000072f7b0bdb4f7 4885db test rbx, rbx
000072f7b0bdb4fa 0f84e7000000 je 0x72f7b0bdb5e7
000072f7b0bdb500 498b3f mov rdi, qword ptr [r15]
000072f7b0bdb503 40383f cmp byte ptr [rdi], dil
000072f7b0bdb506 488d55e0 lea rdx, [rbp - 0x20]
000072f7b0bdb50a 488bf3 mov rsi, rbx
000072f7b0bdb50d ff153d8f61fe call qword ptr [rip - 0x19e70c3]
000072f7b0bdb513 85c0 test eax, eax
000072f7b0bdb515 742b je 0x72f7b0bdb542
000072f7b0bdb517 498b7f08 mov rdi, qword ptr [r15 + 0x8]
000072f7b0bdb51b ff4714 inc dword ptr [rdi + 0x14]
000072f7b0bdb51e 488b5708 mov rdx, qword ptr [rdi + 0x8]
000072f7b0bdb522 8b7710 mov esi, dword ptr [rdi + 0x10]
000072f7b0bdb525 397208 cmp dword ptr [rdx + 0x8], esi
000072f7b0bdb528 0f8693000000 jbe 0x72f7b0bdb5c1
000072f7b0bdb52e 8d4601 lea eax, [rsi + 0x1]
000072f7b0bdb531 894710 mov dword ptr [rdi + 0x10], eax
000072f7b0bdb534 4863f6 movsxd rsi, esi
000072f7b0bdb537 488bfa mov rdi, rdx
000072f7b0bdb53a 488bd3 mov rdx, rbx
000072f7b0bdb53d e8fe8c15fe call 0x72f7aed34240 (System.Runtime.CompilerServices.CastHelpers.StelemRef(System.Object[], IntPtr, System.Object), mdToken: 0000000006006843)
000072f7b0bdb542 4c8bf3 mov r14, rbx
000072f7b0bdb545 48be10a8fcaff7720000 movabs rsi, 0x72f7affca810
000072f7b0bdb54f 493936 cmp qword ptr [r14], rsi
000072f7b0bdb552 7556 jne 0x72f7b0bdb5aa
000072f7b0bdb554 4533f6 xor r14d, r14d
000072f7b0bdb557 4d85f6 test r14, r14
000072f7b0bdb55a 0f8595000000 jne 0x72f7b0bdb5f5
000072f7b0bdb560 48bf10a8fcaff7720000 movabs rdi, 0x72f7affca810
000072f7b0bdb56a 48393b cmp qword ptr [rbx], rdi
000072f7b0bdb56d 7560 jne 0x72f7b0bdb5cf
000072f7b0bdb56f 48837b7000 cmp qword ptr [rbx + 0x70], 0x0
000072f7b0bdb574 740b je 0x72f7b0bdb581
000072f7b0bdb576 4c8b7370 mov r14, qword ptr [rbx + 0x70]
000072f7b0bdb57a bbffffffff mov ebx, 0xffffffff
000072f7b0bdb57f eb1d jmp 0x72f7b0bdb59e
000072f7b0bdb581 488bfb mov rdi, rbx
000072f7b0bdb584 ff15864237ff call qword ptr [rip - 0xc8bd7a]
000072f7b0bdb58a ebea jmp 0x72f7b0bdb576
000072f7b0bdb58c 3bdf cmp ebx, edi
000072f7b0bdb58e 7351 jae 0x72f7b0bdb5e1
000072f7b0bdb590 498b7cde10 mov rdi, qword ptr [r14 + 8*rbx + 0x10]
000072f7b0bdb595 498bf7 mov rsi, r15
000072f7b0bdb598 ff15b2b5e3ff call qword ptr [rip - 0x1c4a4e]
000072f7b0bdb59e ffc3 inc ebx
000072f7b0bdb5a0 418b7e08 mov edi, dword ptr [r14 + 0x8]
000072f7b0bdb5a4 3bfb cmp edi, ebx
000072f7b0bdb5a6 7fe4 jg 0x72f7b0bdb58c
000072f7b0bdb5a8 eb3d jmp 0x72f7b0bdb5e7
000072f7b0bdb5aa 488bf3 mov rsi, rbx
000072f7b0bdb5ad 48bff064cfaff7720000 movabs rdi, 0x72f7afcf64f0
000072f7b0bdb5b7 e8e47513fe call 0x72f7aed12ba0 (System.Runtime.CompilerServices.CastHelpers.IsInstanceOfClass(Void*, System.Object), mdToken: 0000000006006838)
000072f7b0bdb5bc 4c8bf0 mov r14, rax
000072f7b0bdb5bf eb96 jmp 0x72f7b0bdb557
000072f7b0bdb5c1 488bf3 mov rsi, rbx
000072f7b0bdb5c4 ff158ead4dfe call qword ptr [rip - 0x1b25272]
000072f7b0bdb5ca e973ffffff jmp 0x72f7b0bdb542
000072f7b0bdb5cf 488bfb mov rdi, rbx
000072f7b0bdb5d2 488b03 mov rax, qword ptr [rbx]
000072f7b0bdb5d5 488b4048 mov rax, qword ptr [rax + 0x48]
000072f7b0bdb5d9 ff5038 call qword ptr [rax + 0x38]
000072f7b0bdb5dc 4c8bf0 mov r14, rax
000072f7b0bdb5df eb99 jmp 0x72f7b0bdb57a
000072f7b0bdb5e1 e8ca6eee7b call 0x72f82cac24b0 (JitHelp: CORINFO_HELP_RNGCHKFAIL)
000072f7b0bdb5e6 cc int3
000072f7b0bdb5e7 4881c4b8000000 add rsp, 0xb8
000072f7b0bdb5ee 5b pop rbx
000072f7b0bdb5ef 415e pop r14
000072f7b0bdb5f1 415f pop r15
000072f7b0bdb5f3 5d pop rbp
000072f7b0bdb5f4 c3 ret
000072f7b0bdb5f5 498b7e28 mov rdi, qword ptr [r14 + 0x28]
000072f7b0bdb5f9 498bf7 mov rsi, r15
000072f7b0bdb5fc ff154eb5e3ff call qword ptr [rip - 0x1c4ab2]
000072f7b0bdb602 e959ffffff jmp 0x72f7b0bdb560
I'm not sure what the mov rax, [rbp+0]
is doing there. Something related to OSR prolog..?
Both frame 7 and frame 8 have the same frame pointer:
(lldb) fr sel 7
frame #7: 0x0000000000000000
error: core file does not contain 0x0
(lldb) register read
General Purpose Registers:
rax = 0x000072f82d6376a0
rbx = 0x000072b799d213b8
rcx = 0x0000000000000602
rdx = 0x000072f82d638160
rdi = 0x00007fff40434170
rsi = 0x000072f7b0bda404
rbp = 0x00007fff40434e70
rsp = 0x00007fff40434dc8
r8 = 0x00007fff40434138
r9 = 0x000072f7ac145008
r10 = 0x00007fff40434170
r11 = 0x0000000000000000
r12 = 0x000072b796044770
r13 = 0x000072b7962a1390
r14 = 0x000072f7afcf4fe0
r15 = 0x000072b799659e40
rip = 0x0000000000000000
...
(lldb) fr sel 8
frame #8: 0x000072f7b0bda404
-> 0x72f7b0bda404: mov rdi, qword ptr [rbp - 0x30]
0x72f7b0bda408: xor esi, esi
0x72f7b0bda40a: call qword ptr [rip - 0xf03158]
0x72f7b0bda410: test eax, eax
(lldb) register read
General Purpose Registers:
rbx = 0x000072b799d213b8
rbp = 0x00007fff40434e70
rsp = 0x00007fff40434dd0
r12 = 0x000072b796044770
r13 = 0x000072b7962a1390
r14 = 0x000072f7afcf4fe0
r15 = 0x000072b799659e40
rip = 0x000072f7b0bda404
16 registers were unavailable.
This probably indicates that we're returning. Indeed, it appears we tried to return to 0:
(lldb) p *(void**)$rsp
(void *) $19 = 0x0000000000000000
However, it's quite odd that RBP appears to be the same as frame 8. That means we should be about to return to the tier0 version of a method we transitioned out of. That shouldn't be possible. The return address of the tier 0 version right after the patchpoint is very close by:
(lldb) p *(void**)($rsp-8)
(void *) $21 = 0x000072f7b0bda404
Another pointer: rsi
still contains the return address passed to JIT_Patchpoint
, so it seems likely that we haven't managed to transition successfully, since the OSR version of the code would trash rsi
if we did transition successfully.
It looks like all others threads are suspended for GC:
DBG ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception
1 1 f99 00005C24445275E0 20062 Cooperative 000072B799D74D80:000072B799D75010 00005C244450CFC0 -00001 Ukn
6 2 f9f 00005C244452FB20 21220 Preemptive 000072B796802500:000072B796804298 00005C244450CFC0 -00001 Ukn (Finalizer)
7 3 fa3 00005C244453D240 21220 Preemptive 0000000000000000:0000000000000000 00005C244450CFC0 -00001 Ukn
8 4 fa4 00005C2444C7BCD0 1021220 Preemptive 000072B799DE01C0:000072B799DE01C0 00005C244450CFC0 -00001 Ukn (GC) (Threadpool Worker)
9 5 fa5 00005C2444C7F660 1021220 Preemptive 0000000000000000:0000000000000000 00005C244450CFC0 -00001 Ukn (Threadpool Worker)
10 6 fa6 000072B6F000FDB0 1021220 Preemptive 000072B799DCE608:000072B799DCF0B0 00005C244450CFC0 -00001 Ukn (Threadpool Worker)
11 7 fe6 000072B6F016EA60 1021220 Preemptive 000072B799D7E990:000072B799D7F0B0 00005C244450CFC0 -00001 Ukn (Threadpool Worker)
12 8 fe7 000072B6EC122990 21220 Preemptive 0000000000000000:0000000000000000 00005C244450CFC0 -00001 Ukn
13 9 fea 000072B6EC08C210 1021220 Preemptive 000072B799DDFB60:000072B799DE0178 00005C244450CFC0 -00001 Ukn (Threadpool Worker)
14 10 fec 000072B6E001D420 1021220 Preemptive 000072B799D95008:000072B799D95010 00005C244450CFC0 -00001 Ukn (Threadpool Worker)
and everyone is waiting for this transitioning thread to suspend. Perhaps some edge case around EE suspension/hijacking during OSR transitions?
Yeah I have similar looking crashes where we end up trying to execute at address zero. Haven't gotten very far trying to debug.
As far as I know nothing has changed in the OSR prolog generation and/or the patchpoint helper, so I think the problem lies elsewhere, but am not sure.
We've definitely called ClrRestoreNonvolatileContext
not very long ago. r10
still has the address of the CONTEXT
that was passed to restore. In the dump above that address is 0x7fff40434170
and you arrive at the same value from some careful frame pointer arithmetic.
Sadly the CONTEXT
on the stack has been overwritten by the functions used for the dump generation code, so it's hard to tell exactly what context was resumed with (and whether the problem happens right at resumption or later). Perhaps the other dump has more information.
@janvorli this is similar to the crash dump I was asking you about.
Also seems to be correlated with either unusual patchpoint placement (via DOTNET_JitRandomOnStackReplacement=15
) and/or very frequent OSR transitions (via DOTNET_TC_OnStackReplacement_InitialCounter=1
and DOTNET_OSR_HitLimit=2
). So I suspect some kind of race or rare interaction. In normal usage we very rarely do OSR transitions.
I have never been able to repro this locally.
I was going to try running with both OSR stress and GC stress to see if perhaps that would lead to a more consistent repro.
Actually it looks like the signal handler is called on a separate stack, so I am not sure why the CONTEXT
record on the stack appears to be corrupted.
* thread #1, name = 'corerun', stop reason = signal SIGSEGV
frame #0: sp=0x000072f82d64afa0 fp=0x000072f82d64d240 pc=0x000072f82d077613 libc.so.6`___lldb_unnamed_symbol3325 + 115
frame #1: sp=0x000072f82d64d100 fp=0x000072f82d64d240 pc=0x000072f82d0606ca libc.so.6`fprintf + 154
frame #2: sp=0x000072f82d64d1e0 fp=0x000072f82d64d240 pc=0x000072f82ceb4787 libcoreclr.so`PROCCreateCrashDump(argv=size=16, errorMessageBuffer=0x0000000000000000, cbErrorMessageBuffer=0, serialize=<unavailable>) + 1111 at process.cpp:2318 [opt]
frame #3: sp=0x000072f82d64d250 fp=0x000072f82d64d2d0 pc=0x000072f82ceb5a8e libcoreclr.so`::PROCCreateCrashDumpIfEnabled(signal=11, siginfo=0x000072f82d64d4b0, serialize=true) + 2942 at process.cpp:2524 [opt]
frame #4: sp=0x000072f82d64d2e0 fp=0x000072f82d64d320 pc=0x000072f82ce533a9 libcoreclr.so`invoke_previous_action(action=0x000072f82cfa6e98, code=11, siginfo=0x000072f82d64d4b0, context=0x000072f82d64d380, signalRestarts=true) + 377 at signal.cpp:394 [opt]
frame #5: sp=0x000072f82d64d330 fp=0x000072f82d64d370 pc=0x000072f82ce525f5 libcoreclr.so`sigsegv_handler(code=11, siginfo=0x000072f82d64d4b0, context=0x000072f82d64d380) + 341 at signal.cpp:630 [opt]
frame #6: sp=0x000072f82d64d380 fp=0x00007fff40434e70 pc=0x000072f82d042520 libc.so.6`___lldb_unnamed_symbol3237 + 1
* frame #7: sp=0x00007fff40434dc8 fp=0x00007fff40434e70 pc=0x0000000000000000
frame #8: sp=0x00007fff40434dd0 fp=0x00007fff40434e70 pc=0x000072f7b0bda404
...
(lldb) p (CONTEXT*)$r10
(CONTEXT *) $60 = 0x00007fff40434170
(lldb) p (CONTEXT*)($rbp-0xA8-8-0xC50)
(CONTEXT *) $61 = 0x00007fff40434170
(lldb) p *(CONTEXT*)$r10
(CONTEXT) $62 = {
Rax = 0
Rcx = 71776119061217280
Rdx = 0
Rbx = 4294967295
Rsp = 4294967295
Rbp = 4294967295
Rsi = 4294967295
Rdi = 0
R8 = 0
R9 = 34
R10 = 4294967290
R11 = 4294967299993
R12 = 0
R13 = 0
R14 = 0
R15 = 0
Rip = 0
Maybe we should try adding some debug-only code to get some more context the next time we get a dump... have you only seen this issue on linux-x64 @AndyAyersMS?
One thing I wonder is if there should be a volatile read on PerPatchpointInfo::m_osrMethodCode
. Otherwise I think clang can legally compile
https://github.com/dotnet/runtime/blob/5fe1a569099272cbe3cec4972c52e5531223b58e/src/coreclr/vm/jithelpers.cpp#L5160-L5163
to
PCODE osrMethodCode = ppInfo->m_osrMethodCode;
if (ppInfo->m_osrMethodCode == NULL)
{
...
}
// use osrMethodCode that can still be null
This is more of a theoretical problem though, I do not see this codegen in my dump.
Yeah, I believe it is linux only (though perhaps also on arm64) so changes in clang behavior might be a possibility.
Can you link some of the other reports/dumps?
This is the one I was looking at: https://dev.azure.com/dnceng-public/public/_build/results?buildId=650270&view=ms.vss-test-web.build-test-results-tab
This is likely the same problem but hasn't happened recently: https://github.com/dotnet/runtime/issues/98292
The other dump similarly has all threads but the transitioning one waiting for the EE to be suspended.
This other dump has a 0 entry for the OptimizedTier1 code which seems surprising:
(lldb) ip2md 7fc0410f8f1e
MethodDesc: 00007fc0337a7338
Method Name: Microsoft.CodeAnalysis.CSharp.Syntax.InternalSyntax.SyntaxToken+SyntaxIdentifierExtended.get_ValueText()
Class: 00007fc0337bbea8
MethodTable: 00007fc0337a74d0
mdToken: 0000000006009795
Module: 00007fc031a3a3d8
IsJitted: yes
Current CodeAddr: 00007fc0410f8ef0
Version History:
ILCodeVersion: 0000000000000000
ReJIT ID: 0
IL Addr: 00007f7f79c2f6de
CodeAddr: 00007fc03cd8abe0 (OptimizedTier1OSR)
NativeCodeVersion: 00007F7F5413F840
CodeAddr: 00007fc0410f8ef0 (QuickJitted + Instrumented)
NativeCodeVersion: 00007F7F7C992190
CodeAddr: 00007fc0414e4520 (OptimizedTier1OSR)
NativeCodeVersion: 00007F7F7CCB3450
CodeAddr: 0000000000000000 (OptimizedTier1)
NativeCodeVersion: 00007F7F7C570FA0
CodeAddr: 00007fc03cd8ab40 (QuickJitted)
NativeCodeVersion: 0000000000000000
The PerPatchpointInfo
is this one:
(lldb) p *(PerPatchpointInfo*)(((LoaderAllocator*)((Module*)0x00007fc031a3a3d8)->m_loaderAllocator)->m_onStackRep
lacementManager->m_jitPatchpointTable.m_pVolatileBucketTable.m_val->m_pBuckets[4579])->pNext->pNext->Data
(PerPatchpointInfo) $28 = {
m_osrMethodCode = 140463706096928
m_patchpointCount = 2
m_flags = 1
m_patchpointId = 15635
}
At least m_osrMethodCode
seems valid and corresponds to this one:
CodeAddr: 00007fc0414e4520 (OptimizedTier1OSR)
NativeCodeVersion: 00007F7F7CCB3450
FYI this is the code versioning issue fix: #94542. Roughly corresponds to when I recall this OSR issue starting to pop up.
There's a new dump over in https://dev.azure.com/dnceng-public/public/_build/results?buildId=656632&view=ms.vss-test-web.build-test-results-tab.
The crashing registers look like:
(lldb) register read
General Purpose Registers:
rax = 0x000072bf637fe598
rbx = 0x000072c004ca7e18
rcx = 0x0000000000000272
rdx = 0x00005dd985fcdad0
rdi = 0x000072bf637fc940
rsi = 0x0000730018144402
rbp = 0x000072bf637fd640
rsp = 0x000072bf637fd598
r8 = 0x000072bf637fc908
r9 = 0x0000000000000011
r10 = 0x000072bf637fc940
r11 = 0x0000000000000000
r12 = 0x00005dd985fa4f70
r13 = 0x0000000000000001
r14 = 0x000072c004cb6c00
r15 = 0x000072c002812020
rip = 0x0000000000000000
The stored CONTEXT
in static memory looks like:
Rax = 0x0000000000021222
Rcx = 0x0000000000000272
Rdx = 0x00005dd985fcdad0
Rbx = 0x000072c004ca7e18
Rsp = 0x000072bf637fd598
Rbp = 0x000072bf637fd640
Rsi = 0x0000730018144402
Rdi = 0x000072bf637fc940
R8 = 0x000072bf637fc908
R9 = 0x0000000000000011
R10 = 0x0000000000000004
R11 = 0x0000000000000011
R12 = 0x00005dd985fa4f70
R13 = 0x0000000000000001
R14 = 0x000072c004cb6c00
R15 = 0x000072c002812020
Rip = 0x000073001863df00
Everything is reflected properly except rax
, r10
(these aren't restored by ClrRestoreNonvolatileContext
so that's expected) and rip
that has the 0x0.
In the dump these are the first instructions of that OSR function that is being resumed at:
LOAD:000073001863DF00 OnMessageWithTypesOSR proc near
LOAD:000073001863DF00
LOAD:000073001863DF00 var_8 = byte ptr -8
LOAD:000073001863DF00 arg_78 = qword ptr 80h
LOAD:000073001863DF00 arg_80 = qword ptr 88h
LOAD:000073001863DF00 arg_88 = qword ptr 90h
LOAD:000073001863DF00 arg_90 = qword ptr 98h
LOAD:000073001863DF00 arg_98 = qword ptr 0A0h
LOAD:000073001863DF00
LOAD:000073001863DF00 mov rax, [rbp+0]
LOAD:000073001863DF04 push rax
LOAD:000073001863DF05 sub rsp, 70h
LOAD:000073001863DF09 mov [rsp+78h+arg_98], r15
LOAD:000073001863DF11 mov [rsp+78h+arg_90], r14
LOAD:000073001863DF19 mov [rsp+78h+arg_88], r13
LOAD:000073001863DF21 mov [rsp+78h+arg_80], r12
LOAD:000073001863DF29 mov [rsp+78h+arg_78], rbx
LOAD:000073001863DF31 lea rbp, [rsp+78h+var_8]
LOAD:000073001863DF36 mov r15, [rbp+80h]
LOAD:000073001863DF3D mov r14, [rbp+78h]
LOAD:000073001863DF41 mov rbx, [rbp+70h]
LOAD:000073001863DF45 mov r13d, [rbp+6Ch]
LOAD:000073001863DF49 mov rdi, offset unk_730017DDCED0
Notice that in particular we can see that the last instruction there has no been executed yet since rdi
still has the value stored in the CONTEXT
. The only logical conclusion I can see is some bug in GC suspension that causes it to fail to resume the OSR functions correctly in the prolog when it gives up on a hijack.
Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.
@jakobbotsch which dump did you look at? I cannot make any progress with the LibraryImportGenerator.Unit.Tests crashdumps...
I was looking at the System.IO.Compression.Brotli.Tests
one. I had to copy the dotnet executable into the shared folder before opening the dump (with lldb --core <path to core.1000.57146> <correlation-payload/shared/Microsoft.NETCore.App/9.0.0/dotnet
).
I wonder if the xml test stack overflow is related? It is also using OSR stress: https://[helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-heads-main-3756f90697b746dfba/System.Runtime.Serialization.Xml.ReflectionOnly.Tests/1/console.f8de9632.log?helixlogtype=result](https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-heads-main-3756f90697b746dfba/System.Runtime.Serialization.Xml.ReflectionOnly.Tests/1/console.f8de9632.log?helixlogtype=result)
Failed in runtime-jit-experimental on Linux x64.
Error message:
cc @dotnet/jit-contrib.