dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.29k stars 4.74k forks source link

Test failure: readytorun/coreroot_determinism/coreroot_determinism/coreroot_determinism.cmd #101060

Closed amanasifkhalid closed 6 months ago

amanasifkhalid commented 7 months ago

Failed in runtime-jit-experimental on Linux x64.

Error message:

1 / 1 (100%, 1 failed): failed in 18056 msecs, exit code 139, expected 0: corerun /datadisks/disk1/work/AD5609CC/p/crossgen2/crossgen2.dll @/datadisks/disk1/work/AD5609CC/w/BD9609A8/e/readytorun/coreroot_determinism/coreroot_determinism/seed2/CPAOT-ret.out/System.Private.CoreLib.dll.rsp
  [createdump] Gathering state for process 3993 corerun
  [createdump] Crashing thread 0f99 signal 11 (000b)
  [createdump] Writing crash report to file /home/helixbot/dotnetbuild/dumps/coredump.3993.dmp.crashreport.json
  [createdump] Crash report successfully written
  [createdump] Writing minidump with heap to file /home/helixbot/dotnetbuild/dumps/coredump.3993.dmp
  [createdump] Written 202219520 bytes (49370 pages) to core file
  [createdump] Target process is alive
  [createdump] Dump successfully written in 1111ms

Return code:      1
Raw output file:      /datadisks/disk1/work/AD5609CC/w/BD9609A8/uploads/coreroot_determinism/output.txt
Raw output:
BEGIN EXECUTION
/datadisks/disk1/work/AD5609CC/p/corerun -p System.Reflection.Metadata.MetadataUpdater.IsSupported=false -p System.Runtime.Serialization.EnableUnsafeBinaryFormatterSerialization=true coreroot_determinism.dll ''
================================== Compiling with seed 1 ==================================
Locating output folder /datadisks/disk1/work/AD5609CC/w/BD9609A8/e/readytorun/coreroot_determinism/coreroot_determinism/seed1
Deleting 0 output folders
Successfully deleted 0 output folders in 175 msecs
Using dotnet: /datadisks/disk1/work/AD5609CC/p/corerun
1 / 1 (0%, 0 failed): launching: /datadisks/disk1/work/AD5609CC/p/corerun /datadisks/disk1/work/AD5609CC/p/crossgen2/crossgen2.dll @/datadisks/disk1/work/AD5609CC/w/BD9609A8/e/readytorun/coreroot_determinism/coreroot_determinism/seed1/CPAOT-ret.out/System.Private.CoreLib.dll.rsp
1 / 1 (100%, 0 failed): succeeded in 75047 msecs: corerun /datadisks/disk1/work/AD5609CC/p/crossgen2/crossgen2.dll @/datadisks/disk1/work/AD5609CC/w/BD9609A8/e/readytorun/coreroot_determinism/coreroot_determinism/seed1/CPAOT-ret.out/System.Private.CoreLib.dll.rsp
================================== Compiling with seed 2 ==================================
Locating output folder /datadisks/disk1/work/AD5609CC/w/BD9609A8/e/readytorun/coreroot_determinism/coreroot_determinism/seed2
Deleting 0 output folders
Successfully deleted 0 output folders in 78 msecs
Using dotnet: /datadisks/disk1/work/AD5609CC/p/corerun
1 / 1 (0%, 0 failed): launching: /datadisks/disk1/work/AD5609CC/p/corerun /datadisks/disk1/work/AD5609CC/p/crossgen2/crossgen2.dll @/datadisks/disk1/work/AD5609CC/w/BD9609A8/e/readytorun/coreroot_determinism/coreroot_determinism/seed2/CPAOT-ret.out/System.Private.CoreLib.dll.rsp
Compilation failed. /datadisks/disk1/work/AD5609CC/p/corerun /datadisks/disk1/work/AD5609CC/p/R2RTest/R2RTest.dll compile-directory -cr /datadisks/disk1/work/AD5609CC/p -in TestAssemblies --nojit --noexe --large-bubble --release --nocleanup -out seed2 failed with exit code 1
Xunit.Sdk.EqualException: Assert.Equal() Failure: Values differ
Expected: 100
Actual:   1
   at Xunit.Assert.Equal[T](T expected, T actual, IEqualityComparer`1 comparer) in /_/src/Microsoft.DotNet.XUnitAssert/src/EqualityAsserts.cs:line 168
   at Xunit.Assert.Equal[T](T expected, T actual) in /_/src/Microsoft.DotNet.XUnitAssert/src/EqualityAsserts.cs:line 90
   at __GeneratedMainWrapper.Main()
Expected: 100
Actual: 101
END EXECUTION - FAILED

cc @dotnet/jit-contrib.

dotnet-policy-service[bot] commented 7 months ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

jakobbotsch commented 6 months ago

This only failed once, so I'm going to unmark it as blocking (also doesn't repro easily it looks like).

jakobbotsch commented 6 months ago

From the dump it definitely looks OSR related. The stack trace looks like the following:

(lldb) clrstack -f
OS Thread Id: 0xf99 (1)
        Child SP               IP Call Site
000072F82D64AFA0 000072F82D077613 libc.so.6!___lldb_unnamed_symbol3325 + 115
000072F82D64D100 000072F82D0606CA libc.so.6!fprintf + 154
000072F82D64D1E0 000072F82CEB4787 libcoreclr.so!PROCCreateCrashDump(std::vector<char const*, std::allocator<char const*> >&, char*, int, bool) + 1111 at /__w/1/s/src/coreclr/pal/src/thread/process.cpp:2320
000072F82D64D250 000072F82CEB5A8E libcoreclr.so!PROCCreateCrashDumpIfEnabled + 2942 at /__w/1/s/src/coreclr/pal/src/thread/process.cpp:0
000072F82D64D2E0 000072F82CE533A9 libcoreclr.so!invoke_previous_action(sigaction*, int, siginfo_t*, void*, bool) + 377 at /__w/1/s/src/coreclr/pal/src/exception/signal.cpp:397
000072F82D64D330 000072F82CE525F5 libcoreclr.so!sigsegv_handler(int, siginfo_t*, void*) + 341 at /__w/1/s/src/coreclr/pal/src/exception/signal.cpp:631
000072F82D64D380 000072F82D042520 libc.so.6!___lldb_unnamed_symbol3237 + 1
00007FFF40434DC8 0000000000000000
00007FFF40434DD0 000072F7B0BDA404
00007FFF40434E80 000072F7B0BDAEBF
00007FFF40434EC0 000072F7B0BD46D4
00007FFF404350D0 000072F7B0B8BCD2
00007FFF40435170 000072F7B0B7D6E0

(not sure why managed names are missing. Opened https://github.com/dotnet/diagnostics/issues/4637.)

The previous valid return address is:

(lldb) ip2md 72F7B0BDA404
MethodDesc:   000072f7b0a57a90
Method Name:          ILCompiler.LazyGenericsSupport+GraphBuilder.<.ctor>g__AddProcessType|0_1(Internal.TypeSystem.TypeDesc, <>c__DisplayClass0_0 ByRef)
Class:                000072f7b0a621e8
MethodTable:          000072f7b0a57bc8
mdToken:              0000000006000DB6
Module:               000072f7af160888
IsJitted:             yes
Current CodeAddr:     000072f7b0bda3a0
Version History:
  ILCodeVersion:      0000000000000000
  ReJIT ID:           0
  IL Addr:            000072f7ac0b796c
     CodeAddr:           000072f7b0916f40  (OptimizedTier1OSR)
     NativeCodeVersion:  00005C2444DC3C00
     CodeAddr:           000072f7b0bda3a0  (QuickJitted + Instrumented)
     NativeCodeVersion:  000072B6F8589A80
     CodeAddr:           000072f7b0bdb4c0  (OptimizedTier1OSR)
     NativeCodeVersion:  00005C2444DE84A0
     CodeAddr:           0000000000000000  (OptimizedTier1)
     NativeCodeVersion:  000072B6F86842E0
     CodeAddr:           000072f7b08f0640  (QuickJitted)
     NativeCodeVersion:  0000000000000000
...
(lldb) clru 72F7B0BDA404
Normal JIT generated code
ILCompiler.LazyGenericsSupport+GraphBuilder.<.ctor>g__AddProcessType|0_1(Internal.TypeSystem.TypeDesc, <>c__DisplayClass0_0 ByRef)
ilAddr is 000072F7AC0B796C pImport is 000000000427EC30
Begin 000072F7B0BDA3A0, size 26d
000072f7b0bda3a0 55                   push    rbp
000072f7b0bda3a1 4881eca0000000       sub     rsp, 0xa0
000072f7b0bda3a8 488dac24a0000000     lea     rbp, [rsp + 0xa0]
000072f7b0bda3b0 c4413857c0           vxorps  xmm8, xmm8, xmm8
000072f7b0bda3b5 c57e7f8560ffffff     vmovdqu ymmword ptr [rbp - 0xa0], ymm8
000072f7b0bda3bd c57e7f4580           vmovdqu ymmword ptr [rbp - 0x80], ymm8
000072f7b0bda3c2 c57e7f45a0           vmovdqu ymmword ptr [rbp - 0x60], ymm8
000072f7b0bda3c7 33c0                 xor     eax, eax
000072f7b0bda3c9 488945c0             mov     qword ptr [rbp - 0x40], rax
000072f7b0bda3cd 48897dd0             mov     qword ptr [rbp - 0x30], rdi
000072f7b0bda3d1 488975c8             mov     qword ptr [rbp - 0x38], rsi
000072f7b0bda3d5 c78578ffffff01000000 mov     dword ptr [rbp - 0x88], 0x1
000072f7b0bda3df 8b8578ffffff         mov     eax, dword ptr [rbp - 0x88]
000072f7b0bda3e5 ffc8                 dec     eax
000072f7b0bda3e7 898578ffffff         mov     dword ptr [rbp - 0x88], eax
000072f7b0bda3ed 83bd78ffffff00       cmp     dword ptr [rbp - 0x88], 0x0
000072f7b0bda3f4 7f0e                 jg      0x72f7b0bda404
000072f7b0bda3f6 488dbd78ffffff       lea     rdi, [rbp - 0x88]
000072f7b0bda3fd 33f6                 xor     esi, esi
000072f7b0bda3ff e82ce1ee7b           call    0x72f82cac8530 (JitHelp: CORINFO_HELP_PATCHPOINT)
>>> 000072f7b0bda404 488b7dd0             mov     rdi, qword ptr [rbp - 0x30]
000072f7b0bda408 33f6                 xor     esi, esi

So seems like something went wrong while transitioning from the patchpoint.

@AndyAyersMS do you have any thoughts? I don't suppose there are some diagnostics available to e.g. obtain the particular PerPatchpointInfo relevant here?

jakobbotsch commented 6 months ago

The PerPatchpointInfo is this one:

(lldb) p *(PerPatchpointInfo*)0x000072f7b0cc0638
(PerPatchpointInfo) $17 = {
  m_osrMethodCode = 126408147711168
  m_patchpointCount = 2
  m_flags = 1
  m_patchpointId = 1538
}

That method code corresponds to

     CodeAddr:           000072f7b0bdb4c0  (OptimizedTier1OSR)
     NativeCodeVersion:  00005C2444DE84A0

seen above. Unclear whether the issue here is OSR related or whether something bad just happened after transitioning.

jakobbotsch commented 6 months ago

This is the code for the OSR version:

(lldb) clru 72f7b0bdb4c0
Normal JIT generated code
ILCompiler.LazyGenericsSupport+GraphBuilder.<.ctor>g__AddProcessType|0_1(Internal.TypeSystem.TypeDesc, <>c__DisplayClass0_0 ByRef)
ilAddr is 000072F7AC0B796C pImport is 00007ECC5801F550
Begin 000072F7B0BDB4C0, size 147
>>> 000072f7b0bdb4c0 488b4500             mov     rax, qword ptr [rbp]
000072f7b0bdb4c4 50                   push    rax
000072f7b0bdb4c5 4883ec20             sub     rsp, 0x20
000072f7b0bdb4c9 4c89bc24c8000000     mov     qword ptr [rsp + 0xc8], r15
000072f7b0bdb4d1 4c89b424c0000000     mov     qword ptr [rsp + 0xc0], r14
000072f7b0bdb4d9 48899c24b8000000     mov     qword ptr [rsp + 0xb8], rbx
000072f7b0bdb4e1 488d6c2420           lea     rbp, [rsp + 0x20]
000072f7b0bdb4e6 33c0                 xor     eax, eax
000072f7b0bdb4e8 488945e0             mov     qword ptr [rbp - 0x20], rax
000072f7b0bdb4ec 488b9d80000000       mov     rbx, qword ptr [rbp + 0x80]
000072f7b0bdb4f3 4c8b7d78             mov     r15, qword ptr [rbp + 0x78]
000072f7b0bdb4f7 4885db               test    rbx, rbx
000072f7b0bdb4fa 0f84e7000000         je      0x72f7b0bdb5e7
000072f7b0bdb500 498b3f               mov     rdi, qword ptr [r15]
000072f7b0bdb503 40383f               cmp     byte ptr [rdi], dil
000072f7b0bdb506 488d55e0             lea     rdx, [rbp - 0x20]
000072f7b0bdb50a 488bf3               mov     rsi, rbx
000072f7b0bdb50d ff153d8f61fe         call    qword ptr [rip - 0x19e70c3]
000072f7b0bdb513 85c0                 test    eax, eax
000072f7b0bdb515 742b                 je      0x72f7b0bdb542
000072f7b0bdb517 498b7f08             mov     rdi, qword ptr [r15 + 0x8]
000072f7b0bdb51b ff4714               inc     dword ptr [rdi + 0x14]
000072f7b0bdb51e 488b5708             mov     rdx, qword ptr [rdi + 0x8]
000072f7b0bdb522 8b7710               mov     esi, dword ptr [rdi + 0x10]
000072f7b0bdb525 397208               cmp     dword ptr [rdx + 0x8], esi
000072f7b0bdb528 0f8693000000         jbe     0x72f7b0bdb5c1
000072f7b0bdb52e 8d4601               lea     eax, [rsi + 0x1]
000072f7b0bdb531 894710               mov     dword ptr [rdi + 0x10], eax
000072f7b0bdb534 4863f6               movsxd  rsi, esi
000072f7b0bdb537 488bfa               mov     rdi, rdx
000072f7b0bdb53a 488bd3               mov     rdx, rbx
000072f7b0bdb53d e8fe8c15fe           call    0x72f7aed34240 (System.Runtime.CompilerServices.CastHelpers.StelemRef(System.Object[], IntPtr, System.Object), mdToken: 0000000006006843)
000072f7b0bdb542 4c8bf3               mov     r14, rbx
000072f7b0bdb545 48be10a8fcaff7720000 movabs  rsi, 0x72f7affca810
000072f7b0bdb54f 493936               cmp     qword ptr [r14], rsi
000072f7b0bdb552 7556                 jne     0x72f7b0bdb5aa
000072f7b0bdb554 4533f6               xor     r14d, r14d
000072f7b0bdb557 4d85f6               test    r14, r14
000072f7b0bdb55a 0f8595000000         jne     0x72f7b0bdb5f5
000072f7b0bdb560 48bf10a8fcaff7720000 movabs  rdi, 0x72f7affca810
000072f7b0bdb56a 48393b               cmp     qword ptr [rbx], rdi
000072f7b0bdb56d 7560                 jne     0x72f7b0bdb5cf
000072f7b0bdb56f 48837b7000           cmp     qword ptr [rbx + 0x70], 0x0
000072f7b0bdb574 740b                 je      0x72f7b0bdb581
000072f7b0bdb576 4c8b7370             mov     r14, qword ptr [rbx + 0x70]
000072f7b0bdb57a bbffffffff           mov     ebx, 0xffffffff
000072f7b0bdb57f eb1d                 jmp     0x72f7b0bdb59e
000072f7b0bdb581 488bfb               mov     rdi, rbx
000072f7b0bdb584 ff15864237ff         call    qword ptr [rip - 0xc8bd7a]
000072f7b0bdb58a ebea                 jmp     0x72f7b0bdb576
000072f7b0bdb58c 3bdf                 cmp     ebx, edi
000072f7b0bdb58e 7351                 jae     0x72f7b0bdb5e1
000072f7b0bdb590 498b7cde10           mov     rdi, qword ptr [r14 + 8*rbx + 0x10]
000072f7b0bdb595 498bf7               mov     rsi, r15
000072f7b0bdb598 ff15b2b5e3ff         call    qword ptr [rip - 0x1c4a4e]
000072f7b0bdb59e ffc3                 inc     ebx
000072f7b0bdb5a0 418b7e08             mov     edi, dword ptr [r14 + 0x8]
000072f7b0bdb5a4 3bfb                 cmp     edi, ebx
000072f7b0bdb5a6 7fe4                 jg      0x72f7b0bdb58c
000072f7b0bdb5a8 eb3d                 jmp     0x72f7b0bdb5e7
000072f7b0bdb5aa 488bf3               mov     rsi, rbx
000072f7b0bdb5ad 48bff064cfaff7720000 movabs  rdi, 0x72f7afcf64f0
000072f7b0bdb5b7 e8e47513fe           call    0x72f7aed12ba0 (System.Runtime.CompilerServices.CastHelpers.IsInstanceOfClass(Void*, System.Object), mdToken: 0000000006006838)
000072f7b0bdb5bc 4c8bf0               mov     r14, rax
000072f7b0bdb5bf eb96                 jmp     0x72f7b0bdb557
000072f7b0bdb5c1 488bf3               mov     rsi, rbx
000072f7b0bdb5c4 ff158ead4dfe         call    qword ptr [rip - 0x1b25272]
000072f7b0bdb5ca e973ffffff           jmp     0x72f7b0bdb542
000072f7b0bdb5cf 488bfb               mov     rdi, rbx
000072f7b0bdb5d2 488b03               mov     rax, qword ptr [rbx]
000072f7b0bdb5d5 488b4048             mov     rax, qword ptr [rax + 0x48]
000072f7b0bdb5d9 ff5038               call    qword ptr [rax + 0x38]
000072f7b0bdb5dc 4c8bf0               mov     r14, rax
000072f7b0bdb5df eb99                 jmp     0x72f7b0bdb57a
000072f7b0bdb5e1 e8ca6eee7b           call    0x72f82cac24b0 (JitHelp: CORINFO_HELP_RNGCHKFAIL)
000072f7b0bdb5e6 cc                   int3
000072f7b0bdb5e7 4881c4b8000000       add     rsp, 0xb8
000072f7b0bdb5ee 5b                   pop     rbx
000072f7b0bdb5ef 415e                 pop     r14
000072f7b0bdb5f1 415f                 pop     r15
000072f7b0bdb5f3 5d                   pop     rbp
000072f7b0bdb5f4 c3                   ret
000072f7b0bdb5f5 498b7e28             mov     rdi, qword ptr [r14 + 0x28]
000072f7b0bdb5f9 498bf7               mov     rsi, r15
000072f7b0bdb5fc ff154eb5e3ff         call    qword ptr [rip - 0x1c4ab2]
000072f7b0bdb602 e959ffffff           jmp     0x72f7b0bdb560

I'm not sure what the mov rax, [rbp+0] is doing there. Something related to OSR prolog..?

Both frame 7 and frame 8 have the same frame pointer:

(lldb) fr sel 7
frame #7: 0x0000000000000000
error: core file does not contain 0x0
(lldb) register read
General Purpose Registers:
       rax = 0x000072f82d6376a0
       rbx = 0x000072b799d213b8
       rcx = 0x0000000000000602
       rdx = 0x000072f82d638160
       rdi = 0x00007fff40434170
       rsi = 0x000072f7b0bda404
       rbp = 0x00007fff40434e70
       rsp = 0x00007fff40434dc8
        r8 = 0x00007fff40434138
        r9 = 0x000072f7ac145008
       r10 = 0x00007fff40434170
       r11 = 0x0000000000000000
       r12 = 0x000072b796044770
       r13 = 0x000072b7962a1390
       r14 = 0x000072f7afcf4fe0
       r15 = 0x000072b799659e40
       rip = 0x0000000000000000
...
(lldb) fr sel 8
frame #8: 0x000072f7b0bda404
->  0x72f7b0bda404: mov    rdi, qword ptr [rbp - 0x30]
    0x72f7b0bda408: xor    esi, esi
    0x72f7b0bda40a: call   qword ptr [rip - 0xf03158]
    0x72f7b0bda410: test   eax, eax
(lldb) register read
General Purpose Registers:
       rbx = 0x000072b799d213b8
       rbp = 0x00007fff40434e70
       rsp = 0x00007fff40434dd0
       r12 = 0x000072b796044770
       r13 = 0x000072b7962a1390
       r14 = 0x000072f7afcf4fe0
       r15 = 0x000072b799659e40
       rip = 0x000072f7b0bda404
16 registers were unavailable.

This probably indicates that we're returning. Indeed, it appears we tried to return to 0:

(lldb) p *(void**)$rsp
(void *) $19 = 0x0000000000000000

However, it's quite odd that RBP appears to be the same as frame 8. That means we should be about to return to the tier0 version of a method we transitioned out of. That shouldn't be possible. The return address of the tier 0 version right after the patchpoint is very close by:

(lldb) p *(void**)($rsp-8)
(void *) $21 = 0x000072f7b0bda404
jakobbotsch commented 6 months ago

Another pointer: rsi still contains the return address passed to JIT_Patchpoint, so it seems likely that we haven't managed to transition successfully, since the OSR version of the code would trash rsi if we did transition successfully.

jakobbotsch commented 6 months ago

It looks like all others threads are suspended for GC:

DBG   ID     OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
   1    1      f99 00005C24445275E0    20062 Cooperative 000072B799D74D80:000072B799D75010 00005C244450CFC0 -00001 Ukn
   6    2      f9f 00005C244452FB20    21220 Preemptive  000072B796802500:000072B796804298 00005C244450CFC0 -00001 Ukn (Finalizer)
   7    3      fa3 00005C244453D240    21220 Preemptive  0000000000000000:0000000000000000 00005C244450CFC0 -00001 Ukn
   8    4      fa4 00005C2444C7BCD0  1021220 Preemptive  000072B799DE01C0:000072B799DE01C0 00005C244450CFC0 -00001 Ukn (GC) (Threadpool Worker)
   9    5      fa5 00005C2444C7F660  1021220 Preemptive  0000000000000000:0000000000000000 00005C244450CFC0 -00001 Ukn (Threadpool Worker)
  10    6      fa6 000072B6F000FDB0  1021220 Preemptive  000072B799DCE608:000072B799DCF0B0 00005C244450CFC0 -00001 Ukn (Threadpool Worker)
  11    7      fe6 000072B6F016EA60  1021220 Preemptive  000072B799D7E990:000072B799D7F0B0 00005C244450CFC0 -00001 Ukn (Threadpool Worker)
  12    8      fe7 000072B6EC122990    21220 Preemptive  0000000000000000:0000000000000000 00005C244450CFC0 -00001 Ukn
  13    9      fea 000072B6EC08C210  1021220 Preemptive  000072B799DDFB60:000072B799DE0178 00005C244450CFC0 -00001 Ukn (Threadpool Worker)
  14   10      fec 000072B6E001D420  1021220 Preemptive  000072B799D95008:000072B799D95010 00005C244450CFC0 -00001 Ukn (Threadpool Worker)

and everyone is waiting for this transitioning thread to suspend. Perhaps some edge case around EE suspension/hijacking during OSR transitions?

AndyAyersMS commented 6 months ago

Yeah I have similar looking crashes where we end up trying to execute at address zero. Haven't gotten very far trying to debug.

As far as I know nothing has changed in the OSR prolog generation and/or the patchpoint helper, so I think the problem lies elsewhere, but am not sure.

jakobbotsch commented 6 months ago

We've definitely called ClrRestoreNonvolatileContext not very long ago. r10 still has the address of the CONTEXT that was passed to restore. In the dump above that address is 0x7fff40434170 and you arrive at the same value from some careful frame pointer arithmetic.

Sadly the CONTEXT on the stack has been overwritten by the functions used for the dump generation code, so it's hard to tell exactly what context was resumed with (and whether the problem happens right at resumption or later). Perhaps the other dump has more information.

AndyAyersMS commented 6 months ago

@janvorli this is similar to the crash dump I was asking you about.

Also seems to be correlated with either unusual patchpoint placement (via DOTNET_JitRandomOnStackReplacement=15) and/or very frequent OSR transitions (via DOTNET_TC_OnStackReplacement_InitialCounter=1 and DOTNET_OSR_HitLimit=2). So I suspect some kind of race or rare interaction. In normal usage we very rarely do OSR transitions.

I have never been able to repro this locally.

I was going to try running with both OSR stress and GC stress to see if perhaps that would lead to a more consistent repro.

jakobbotsch commented 6 months ago

Actually it looks like the signal handler is called on a separate stack, so I am not sure why the CONTEXT record on the stack appears to be corrupted.

* thread #1, name = 'corerun', stop reason = signal SIGSEGV
    frame #0: sp=0x000072f82d64afa0 fp=0x000072f82d64d240 pc=0x000072f82d077613 libc.so.6`___lldb_unnamed_symbol3325 + 115
    frame #1: sp=0x000072f82d64d100 fp=0x000072f82d64d240 pc=0x000072f82d0606ca libc.so.6`fprintf + 154
    frame #2: sp=0x000072f82d64d1e0 fp=0x000072f82d64d240 pc=0x000072f82ceb4787 libcoreclr.so`PROCCreateCrashDump(argv=size=16, errorMessageBuffer=0x0000000000000000, cbErrorMessageBuffer=0, serialize=<unavailable>) + 1111 at process.cpp:2318 [opt]
    frame #3: sp=0x000072f82d64d250 fp=0x000072f82d64d2d0 pc=0x000072f82ceb5a8e libcoreclr.so`::PROCCreateCrashDumpIfEnabled(signal=11, siginfo=0x000072f82d64d4b0, serialize=true) + 2942 at process.cpp:2524 [opt]
    frame #4: sp=0x000072f82d64d2e0 fp=0x000072f82d64d320 pc=0x000072f82ce533a9 libcoreclr.so`invoke_previous_action(action=0x000072f82cfa6e98, code=11, siginfo=0x000072f82d64d4b0, context=0x000072f82d64d380, signalRestarts=true) + 377 at signal.cpp:394 [opt]
    frame #5: sp=0x000072f82d64d330 fp=0x000072f82d64d370 pc=0x000072f82ce525f5 libcoreclr.so`sigsegv_handler(code=11, siginfo=0x000072f82d64d4b0, context=0x000072f82d64d380) + 341 at signal.cpp:630 [opt]
    frame #6: sp=0x000072f82d64d380 fp=0x00007fff40434e70 pc=0x000072f82d042520 libc.so.6`___lldb_unnamed_symbol3237 + 1
  * frame #7: sp=0x00007fff40434dc8 fp=0x00007fff40434e70 pc=0x0000000000000000
    frame #8: sp=0x00007fff40434dd0 fp=0x00007fff40434e70 pc=0x000072f7b0bda404
...
(lldb) p (CONTEXT*)$r10
(CONTEXT *) $60 = 0x00007fff40434170
(lldb) p (CONTEXT*)($rbp-0xA8-8-0xC50)
(CONTEXT *) $61 = 0x00007fff40434170
(lldb) p *(CONTEXT*)$r10
(CONTEXT) $62 = {
  Rax = 0
  Rcx = 71776119061217280
  Rdx = 0
  Rbx = 4294967295
  Rsp = 4294967295
  Rbp = 4294967295
  Rsi = 4294967295
  Rdi = 0
  R8 = 0
  R9 = 34
  R10 = 4294967290
  R11 = 4294967299993
  R12 = 0
  R13 = 0
  R14 = 0
  R15 = 0
  Rip = 0

Maybe we should try adding some debug-only code to get some more context the next time we get a dump... have you only seen this issue on linux-x64 @AndyAyersMS?

jakobbotsch commented 6 months ago

One thing I wonder is if there should be a volatile read on PerPatchpointInfo::m_osrMethodCode. Otherwise I think clang can legally compile https://github.com/dotnet/runtime/blob/5fe1a569099272cbe3cec4972c52e5531223b58e/src/coreclr/vm/jithelpers.cpp#L5160-L5163 to

PCODE osrMethodCode = ppInfo->m_osrMethodCode;
if (ppInfo->m_osrMethodCode == NULL)
{
  ...
}
// use osrMethodCode that can still be null

This is more of a theoretical problem though, I do not see this codegen in my dump.

AndyAyersMS commented 6 months ago

Yeah, I believe it is linux only (though perhaps also on arm64) so changes in clang behavior might be a possibility.

jakobbotsch commented 6 months ago

Can you link some of the other reports/dumps?

AndyAyersMS commented 6 months ago

This is the one I was looking at: https://dev.azure.com/dnceng-public/public/_build/results?buildId=650270&view=ms.vss-test-web.build-test-results-tab

AndyAyersMS commented 6 months ago

This is likely the same problem but hasn't happened recently: https://github.com/dotnet/runtime/issues/98292

jakobbotsch commented 6 months ago

The other dump similarly has all threads but the transitioning one waiting for the EE to be suspended.

jakobbotsch commented 6 months ago

This other dump has a 0 entry for the OptimizedTier1 code which seems surprising:

(lldb) ip2md 7fc0410f8f1e
MethodDesc:   00007fc0337a7338
Method Name:          Microsoft.CodeAnalysis.CSharp.Syntax.InternalSyntax.SyntaxToken+SyntaxIdentifierExtended.get_ValueText()
Class:                00007fc0337bbea8
MethodTable:          00007fc0337a74d0
mdToken:              0000000006009795
Module:               00007fc031a3a3d8
IsJitted:             yes
Current CodeAddr:     00007fc0410f8ef0
Version History:
  ILCodeVersion:      0000000000000000
  ReJIT ID:           0
  IL Addr:            00007f7f79c2f6de
     CodeAddr:           00007fc03cd8abe0  (OptimizedTier1OSR)
     NativeCodeVersion:  00007F7F5413F840
     CodeAddr:           00007fc0410f8ef0  (QuickJitted + Instrumented)
     NativeCodeVersion:  00007F7F7C992190
     CodeAddr:           00007fc0414e4520  (OptimizedTier1OSR)
     NativeCodeVersion:  00007F7F7CCB3450
     CodeAddr:           0000000000000000  (OptimizedTier1)
     NativeCodeVersion:  00007F7F7C570FA0
     CodeAddr:           00007fc03cd8ab40  (QuickJitted)
     NativeCodeVersion:  0000000000000000

The PerPatchpointInfo is this one:

(lldb) p *(PerPatchpointInfo*)(((LoaderAllocator*)((Module*)0x00007fc031a3a3d8)->m_loaderAllocator)->m_onStackRep
lacementManager->m_jitPatchpointTable.m_pVolatileBucketTable.m_val->m_pBuckets[4579])->pNext->pNext->Data
(PerPatchpointInfo) $28 = {
  m_osrMethodCode = 140463706096928
  m_patchpointCount = 2
  m_flags = 1
  m_patchpointId = 15635
}

At least m_osrMethodCode seems valid and corresponds to this one:

     CodeAddr:           00007fc0414e4520  (OptimizedTier1OSR)
     NativeCodeVersion:  00007F7F7CCB3450
AndyAyersMS commented 6 months ago

FYI this is the code versioning issue fix: #94542. Roughly corresponds to when I recall this OSR issue starting to pop up.

jakobbotsch commented 6 months ago

There's a new dump over in https://dev.azure.com/dnceng-public/public/_build/results?buildId=656632&view=ms.vss-test-web.build-test-results-tab.

The crashing registers look like:

(lldb) register read
General Purpose Registers:
       rax = 0x000072bf637fe598
       rbx = 0x000072c004ca7e18
       rcx = 0x0000000000000272
       rdx = 0x00005dd985fcdad0
       rdi = 0x000072bf637fc940
       rsi = 0x0000730018144402
       rbp = 0x000072bf637fd640
       rsp = 0x000072bf637fd598
        r8 = 0x000072bf637fc908
        r9 = 0x0000000000000011
       r10 = 0x000072bf637fc940
       r11 = 0x0000000000000000
       r12 = 0x00005dd985fa4f70
       r13 = 0x0000000000000001
       r14 = 0x000072c004cb6c00
       r15 = 0x000072c002812020
       rip = 0x0000000000000000

The stored CONTEXT in static memory looks like:

  Rax = 0x0000000000021222
  Rcx = 0x0000000000000272
  Rdx = 0x00005dd985fcdad0
  Rbx = 0x000072c004ca7e18
  Rsp = 0x000072bf637fd598
  Rbp = 0x000072bf637fd640
  Rsi = 0x0000730018144402
  Rdi = 0x000072bf637fc940
  R8 = 0x000072bf637fc908
  R9 = 0x0000000000000011
  R10 = 0x0000000000000004
  R11 = 0x0000000000000011
  R12 = 0x00005dd985fa4f70
  R13 = 0x0000000000000001
  R14 = 0x000072c004cb6c00
  R15 = 0x000072c002812020
  Rip = 0x000073001863df00

Everything is reflected properly except rax, r10 (these aren't restored by ClrRestoreNonvolatileContext so that's expected) and rip that has the 0x0.

In the dump these are the first instructions of that OSR function that is being resumed at:

LOAD:000073001863DF00 OnMessageWithTypesOSR proc near
LOAD:000073001863DF00
LOAD:000073001863DF00 var_8           = byte ptr -8
LOAD:000073001863DF00 arg_78          = qword ptr  80h
LOAD:000073001863DF00 arg_80          = qword ptr  88h
LOAD:000073001863DF00 arg_88          = qword ptr  90h
LOAD:000073001863DF00 arg_90          = qword ptr  98h
LOAD:000073001863DF00 arg_98          = qword ptr  0A0h
LOAD:000073001863DF00
LOAD:000073001863DF00                 mov     rax, [rbp+0]
LOAD:000073001863DF04                 push    rax
LOAD:000073001863DF05                 sub     rsp, 70h
LOAD:000073001863DF09                 mov     [rsp+78h+arg_98], r15
LOAD:000073001863DF11                 mov     [rsp+78h+arg_90], r14
LOAD:000073001863DF19                 mov     [rsp+78h+arg_88], r13
LOAD:000073001863DF21                 mov     [rsp+78h+arg_80], r12
LOAD:000073001863DF29                 mov     [rsp+78h+arg_78], rbx
LOAD:000073001863DF31                 lea     rbp, [rsp+78h+var_8]
LOAD:000073001863DF36                 mov     r15, [rbp+80h]
LOAD:000073001863DF3D                 mov     r14, [rbp+78h]
LOAD:000073001863DF41                 mov     rbx, [rbp+70h]
LOAD:000073001863DF45                 mov     r13d, [rbp+6Ch]
LOAD:000073001863DF49                 mov     rdi, offset unk_730017DDCED0

Notice that in particular we can see that the last instruction there has no been executed yet since rdi still has the value stored in the CONTEXT. The only logical conclusion I can see is some bug in GC suspension that causes it to fail to resume the OSR functions correctly in the prolog when it gives up on a hijack.

dotnet-policy-service[bot] commented 6 months ago

Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.

AndyAyersMS commented 6 months ago

@jakobbotsch which dump did you look at? I cannot make any progress with the LibraryImportGenerator.Unit.Tests crashdumps...

jakobbotsch commented 6 months ago

I was looking at the System.IO.Compression.Brotli.Tests one. I had to copy the dotnet executable into the shared folder before opening the dump (with lldb --core <path to core.1000.57146> <correlation-payload/shared/Microsoft.NETCore.App/9.0.0/dotnet).

AndyAyersMS commented 6 months ago

I wonder if the xml test stack overflow is related? It is also using OSR stress: https://[helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-heads-main-3756f90697b746dfba/System.Runtime.Serialization.Xml.ReflectionOnly.Tests/1/console.f8de9632.log?helixlogtype=result](https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-heads-main-3756f90697b746dfba/System.Runtime.Serialization.Xml.ReflectionOnly.Tests/1/console.f8de9632.log?helixlogtype=result)