dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.25k stars 4.73k forks source link

.NET Core 3.1 - AsmMacros.inc - memory operand not aligned by 16 causing Access violation (0xc0000005) #65787

Open TalAloni opened 2 years ago

TalAloni commented 2 years ago

Description

The issue was observed twice on .NET Core 3.1.10 (out of >1000 repro attempts). We have captured two memory dumps and used them to pinpoint the source of the issue, which seems to be an incorrect usage of movdqa assembly instruction in AsmMacros.inc for operand that is not divisible by 16.

image

image

According to the definition of the assembly instruction movdqa (see here and here), the memory operand should be aligned by 16. As can be seen, in both cases the operand is not divisible by 16, which triggers an Access violation (0xc0000005).

For reference, here is the a link to the source code containing this assembly instruction: https://github.com/dotnet/coreclr/blob/b4f19e3e849044ffe4feb9f7788edda9a129a773/src/vm/amd64/AsmMacros.inc#L78

Reproduction Steps

This issue is very hard to reproduce, only 2 of our 1000+ attempts reproduced the issue.

Expected behavior

We expect that the process would not crash

Actual behavior

The process crash due to Access violation (0xc0000005)

Regression?

We have migrated from .NET Framework 4.7.2 to .NET Core 3.1.10, the issue did not occur under .NET Framework 4.7.2

Known Workarounds

There isn't any workaround as far as I can tell

Configuration

.NET Core 3.1.10, Windows Server 2012 R2, 64-bit process.

Other information

No response

dotnet-issue-labeler[bot] commented 2 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

jkotas commented 2 years ago

rsp reqister is expected to be aligned at 16 bytes on x64. If you are seeing misaligned access here, the problem is that some code up the stack is creating misaligned rsp. Could you please share the stacktrace of the crash or find the code that is creating the misaligned rsp?

TalAloni commented 2 years ago

Thanks, here is the top of the stack (the rest is user code):

OS Thread Id: 0x1f30 (46)
Current frame: coreclr!ThePreStub + 0x30 [F:\workspace\_work\1\s\src\vm\amd64\ThePreStubAMD64.asm:13]
Child-SP         RetAddr          Caller, Callee
000000426865DA10 00007ffc99965a65 coreclr!MethodTable::CheckRunClassInitThrowing + 0xb5 [f:\workspace\_work\1\s\src\vm\methodtable.cpp:3762], calling coreclr!MethodTable::DoRunClassInitThrowing [f:\workspace\_work\1\s\src\vm\methodtable.cpp:3429]
000000426865DA18 00007ffc3d5e9870 (MethodDesc 00007ffc39ff1790 + 0x30 System.Environment.get_CurrentManagedThreadId()), calling coreclr!ThreadNative::GetManagedThreadId [f:\workspace\_work\1\s\src\vm\comsynchronizable.cpp:687]
000000426865DA28 00007ffc99971ca2 coreclr!JIT_GenericHandle_Framed + 0xca [f:\workspace\_work\1\s\src\vm\jithelpers.cpp:3855], calling coreclr!LazyMachStateCaptureState [F:\workspace\_work\1\s\src\vm\amd64\getstate.asm:62]
000000426865DA40 00007ffc99a2b119 coreclr!JIT_GetSharedNonGCStaticBase_Helper + 0xd9 [f:\workspace\_work\1\s\src\vm\jithelpers.cpp:1414], calling coreclr!__security_check_cookie [d:\agent\_work\2\s\src\vctools\crt\vcstartup\src\gs\amd64\amdsecgs.asm:45]
000000426865DAB0 00007ffc417f6ce6 (MethodDesc 00007ffc41148a50 + 0x286 System.Linq.Enumerable.Select[[System.__Canon, System.Private.CoreLib],[System.UInt64, System.Private.CoreLib]](System.Collections.Generic.IEnumerable`1<System.__Canon>, System.Func`2<System.__Canon,UInt64>)), calling 00007ffc417f1f10 (stub for System.Linq.Enumerable+SelectListIterator`2[[System.__Canon, System.Private.CoreLib],[System.UInt64, System.Private.CoreLib]]..ctor(System.Collections.Generic.List`1, System.Func`2))
OS Thread Id: 0x1284 (27)
Current frame: coreclr!ThePreStub + 0x30 [F:\workspace\_work\1\s\src\vm\amd64\ThePreStubAMD64.asm:13]
Child-SP         RetAddr          Caller, Callee
000000A351BBB878 00007fff97c98638 (MethodDesc 00007fff9187fba8 + 0x58 System.Collections.Generic.List`1[[System.__Canon, System.Private.CoreLib]].System.Collections.Generic.IEnumerable<T>.GetEnumerator()), calling coreclr!JIT_CheckedWriteBarrier [F:\workspace\_work\1\s\src\vm\amd64\JitHelpers_Fast.asm:436]
000000A351BBB888 00007fff977d4d44 (MethodDesc 00007fff91a85fa0 + 0x44 System.Collections.Generic.List`1+Enumerator[[System.__Canon, System.Private.CoreLib]].MoveNext()), calling coreclr!JIT_CheckedWriteBarrier [F:\workspace\_work\1\s\src\vm\amd64\JitHelpers_Fast.asm:436]
000000A351BBB898 00007ffff1295560 coreclr!JIT_GetSharedGCThreadStaticBase + 0x70 [f:\workspace\_work\1\s\src\vm\jithelpers.cpp:1838], calling coreclr!__security_check_cookie [d:\agent\_work\2\s\src\vctools\crt\vcstartup\src\gs\amd64\amdsecgs.asm:45]
000000A351BBB8C8 00007fff94e9b3e4 (MethodDesc 00007fff91ce1828 + 0x194 System.Linq.Enumerable+SelectManySingleSelectorIterator`2[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]].MoveNext()), calling coreclr!JIT_WriteBarrier [F:\workspace\_work\1\s\src\vm\amd64\JitHelpers_Fast.asm:466]
000000A351BBB918 00007ffff1bae158 (MethodDesc 00007fff96bba3d8 + 0x128 System.Linq.Enumerable.TryGetLast[[System.__Canon, System.Private.CoreLib]](System.Collections.Generic.IEnumerable`1<System.__Canon>, System.Func`2<System.__Canon,Boolean>, Boolean ByRef))
000000A351BBB9A8 00007ffff1bade5b (MethodDesc 00007fff96bba0d0 + 0x2b System.Linq.Enumerable.LastOrDefault[[System.__Canon, System.Private.CoreLib]](System.Collections.Generic.IEnumerable`1<System.__Canon>, System.Func`2<System.__Canon,Boolean>)), calling 00007fff965f3bc8 (stub for System.Linq.Enumerable.TryGetLast[[System.__Canon, System.Private.CoreLib]](System.Collections.Generic.IEnumerable`1, System.Func`2, Boolean ByRef))
000000A351BBB9B0 00007fff97e01fa2 (MethodDesc 00007fff91cbeb48 + 0x92 System.Linq.Enumerable.SelectMany[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]](System.Collections.Generic.IEnumerable`1<System.__Canon>, System.Func`2<System.__Canon,System.Collections.Generic.IEnumerable`1<System.__Canon>>)), calling coreclr!JIT_WriteBarrier [F:\workspace\_work\1\s\src\vm\amd64\JitHelpers_Fast.asm:466]
000000A351BBB9C0 00007fff970200f8 (MethodDesc 00007fff95b3e1d0 + 0x98 Newtonsoft.Json.Serialization.DefaultContractResolver.GetClassHierarchyForType(System.Type)), calling (MethodDesc 00007fff9187fd28 + 0 System.Collections.Generic.List`1[[System.__Canon, System.Private.CoreLib]].Reverse(Int32, Int32))
000000A351BBB9F8 00007fff965f3bad 00007fff965f3bad, calling coreclr!PrecodeFixupThunk [F:\workspace\_work\1\s\src\vm\amd64\AsmHelpers.asm:251]
000000A351BBBA00 00007fff97026ae9 (MethodDesc 00007fff95b3e068 + 0x159 Newtonsoft.Json.Serialization.DefaultContractResolver.GetExtensionDataMemberForType(System.Type)), calling 00007fff965f3ba8 (stub for System.Linq.Enumerable.LastOrDefault[[System.__Canon, System.Private.CoreLib]](System.Collections.Generic.IEnumerable`1, System.Func`2))
000000A351BBBA90 00007fff9701d4b8 (MethodDesc 00007fff95b3e040 + 0x5d8 Newtonsoft.Json.Serialization.DefaultContractResolver.CreateObjectContract(System.Type)), calling 00007fff957a8d10 (stub for Newtonsoft.Json.Serialization.DefaultContractResolver.GetExtensionDataMemberForType(System.Type))
000000A351BBBC10 00007fff970192cd (MethodDesc 00007fff95b3e258 + 0x37d Newtonsoft.Json.Serialization.DefaultContractResolver.CreateContract(System.Type))
jkotas commented 2 years ago

There are the noisy !eestack stacktraces. It is not possible to tell where things went wrong from them.

Could you please share the accurate stacktraces from !k or !clrstack?

TalAloni commented 2 years ago

!clrstack:

OS Thread Id: 0x1f30 (46)
        Child SP               IP Call Site
000000426865E8C8 00007ffc99a56470 [GCFrame: 000000426865e8c8] 
000000426865EB30 00007ffc99a56470 [DebuggerU2MCatchHandlerFrame: 000000426865eb30] 
OS Thread Id: 0x1284 (27)
        Child SP               IP Call Site
000000A351BBF160 00007ffff12f6470 [DebuggerU2MCatchHandlerFrame: 000000a351bbf160]
TalAloni commented 2 years ago

rsp reqister is expected to be aligned at 16 bytes on x64. If you are seeing misaligned access here, the problem is that some code up the stack is creating misaligned rsp.

Does this mean that 'PWTB_StackAlloc' should be divisible by 16? because it is not. We can see sub rsp,68h in the print-screen in the description, IIUC this is actually `alloc_stack PWTB_StackAlloc` we can see here

jkotas commented 2 years ago

!clrstack:

Looks like clrstack has troubles walking the stack. Could you please also share k?

Does this mean that '__PWTB_StackAlloc' should be divisible by 16? because it is not.

No. The aligned rsp is required at a callsite. It is ok for rsp to temporarily misaligned during method prolog and epilog. The invariant is that the total amount pushed by the method call (ie return address) and the method prolog must be divisible by 16. This invariant is maintained by ThePreStub.

TalAloni commented 2 years ago

!k result in No export k found

kpL100

 # Child-SP          RetAddr           Call Site
00 00000042`6865d9e8 00007ffc`3a842ac0 coreclr!ThePreStub(void)+0x30
01 00000042`6865da98 0000003f`0b0a7170 0x00007ffc`3a842ac0
02 00000042`6865daa0 00000040`0c775c80 0x0000003f`0b0a7170
03 00000042`6865daa8 00000042`570e0200 0x00000040`0c775c80
04 00000042`6865dab0 00000000`00000004 0x00000042`570e0200
05 00000042`6865dab8 00007ffc`417f6ce6 0x4
06 00000042`6865dac0 00000040`0c775c80 0x00007ffc`417f6ce6
07 00000042`6865dac8 00000040`0c59cfe8 0x00000040`0c775c80
08 00000042`6865dad0 00000041`0c7875c8 0x00000040`0c59cfe8
09 00000042`6865dad8 00000042`56278e50 0x00000041`0c7875c8
0a 00000042`6865dae0 00000000`00000000 0x00000042`56278e50
# Child-SP          RetAddr           Call Site
00 000000a3`51bbb878 00007fff`f1bae158 coreclr!ThePreStub(void)+0x30
01 000000a3`51bbb928 00007fff`f1bade5b System_Linq!System.Linq.Enumerable.TryGetLast[System.__Canon]+0x128
02 000000a3`51bbb9b8 00007fff`965f3bad System_Linq!System.Linq.Enumerable.LastOrDefault[System.__Canon]+0x2b
03 000000a3`51bbba08 00007fff`97026ae9 0x00007fff`965f3bad
04 000000a3`51bbba10 00000000`00000000 0x00007fff`97026ae9

I can send links to both dumps to your email address for further analysis.

TalAloni commented 2 years ago

Sent an email with download links

jkotas commented 2 years ago

Thank you for sharing the dumps. The dumps show very significant stack corruption. The content of the registers does not match with the content of the stack memory. For example, the memory at rsp+0x68 should have values of saved registers but it is not the case (this is from the first dump):

0:046> r
rax=00007ffc416e1f10 rbx=000000426865d7b0 rcx=0000003f0b0a7170
rdx=000000400c775c80 rsi=0000000000000000 rdi=000000426865d2c0
rip=00007ffc99a56470 rsp=000000426865d9e8 rbp=000000426865dab0
 r8=00000042570e0200  r9=0000000000000004 r10=00007ffc416e1f20
r11=0000000000000002 r12=000000426865e750 r13=000000426865e910
r14=0000000000000004 r15=000000426865e8f8
iopl=0         nv up ei pl nz ac po nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010214
coreclr!ThePreStub+0x30:
00007ffc`99a56470 660f7f442420    movdqa  xmmword ptr [rsp+20h],xmm0 ss:00000042`6865da08=000000426865e7500000739c7710ad64
0:046> dd 000000426865d9e8  L50
00000042`6865d9e8  00000000 00000000 0c5211f0 00000040
00000042`6865d9f8  7710bb4c 0000739c 6865e8f8 00000042
00000042`6865da08  7710ad64 0000739c 6865e750 00000042
00000042`6865da18  99965a65 00007ffc 3d5e9870 00007ffc
00000042`6865da28  00000000 00000000 99971ca2 00007ffc
00000042`6865da38  6865d828 00000042 7710acbc 0000739c
00000042`6865da48  99a2b119 00007ffc 0c775c80 00000040 <- saved rdi - it should be same as the value above
00000042`6865da58  4235daf0 00007ffc 0b0a6ea0 0000003f <- saved rsi, rbx - it should be same as the values above
00000042`6865da68  6865dab0 00000042 6865e750 00000042
00000042`6865da78  6865e910 00000042 00000004 00000000
00000042`6865da88  6865e8f8 00000042 3a842ac0 00007ffc
00000042`6865da98  0b0a7170 0000003f 0c775c80 00000040
00000042`6865daa8  570e0200 00000042 00000004 00000000
00000042`6865dab8  417f6ce6 00007ffc 0c775c80 00000040
00000042`6865dac8  0c59cfe8 00000040 0c7875c8 00000041
00000042`6865dad8  56278e50 00000042 00000000 00000000
00000042`6865dae8  0c520de8 00000040 3e5da5b8 00007ffc
00000042`6865daf8  40731672 00007ffc 00000406 00000000
00000042`6865db08  00000000 00000000 00000000 00000000
00000042`6865db18  00000000 00000000 00000000 00000000
jkotas commented 2 years ago

I am not able to find any clues what might have happened to get into this state. You can try setting COMPlus_StressLog=1 env variable and wait for the crash to occur. COMPlus_StressLog=1 enables flight data recorder that stored information about what happened in the process recently. You can then dump it using !DumpLog command in debugger.

TalAloni commented 2 years ago

Thanks for your analysis. So are you suggesting that I'll try to reproduce with the COMPlus_StressLog=1 environment variable and provide a third process crash dump from an environment that already had this set?

jkotas commented 2 years ago

try to reproduce with the COMPlus_StressLog=1 environment variable and provide a third process crash dump from an environment that already had this set?

Yep

jkotas commented 2 years ago

Thank you for sharing the dump with stress log. I do not see anything interesting happening in the runtime on the thread that crashed in the dump.

I think that the crash is likely caused by some code in the libraries that you are using, the problem is likely in managed/unmanaged interop.

Do you have a hunch what it may be? For example, a library that does a lot managed/unmanaged interop that you had problems with it in the past.

ghost commented 2 years ago

This issue has been marked needs-author-action since it may be missing important information. Please refer to our contribution guidelines for tips on how to report issues effectively.

TalAloni commented 2 years ago

Thank you Jan, No, I don't have any library to suspect, note that we did not have any issue before the migration from .NET Framework to .NET Core 3.1. I'm not seeing what we could be using that would have such effect on registers. Is there a way to get additional information in order to pin-point the source of the issue?

jkotas commented 2 years ago

I'm not seeing what we could be using that would have such effect on registers.

The registers are not the problem. The problem is that the stack memory is overwritten. It is probably getting overwritten by some other thread.

Here is a program that demonstrates what is likely happening. The program is going to crash with fatal errors and no way to tell how we got there. There are no breadcrumbs that would connect the crash with OverwriteStackAsynchronously.

OverwriteStackAsynchronously();

for (;;) Console.WriteLine("Hello world!");

unsafe void OverwriteStackAsynchronously()
{
    int* p = GetStackPointer();

    Task.Run(() => {
        for (;;) for (int i = 0; i < 100; i++) *(p-i) = i;
    });

    static unsafe int* GetStackPointer()
    {
        int x = 0;
        return &x;
    }
}

Is there a way to get additional information in order to pin-point the source of the issue?

One option is to try to catch the crash using TTD https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/time-travel-debugging-overview , but it may be too time consuming given that the issue is hard to reproduce.

Do you have any logging in your app? Can you see any patterns in what the app was doing that preceded this crash?

we did not have any issue before the migration from .NET Framework to .NET Core 3.1.

One potential explanation of why you have not seen it in .NET Framework is that .NET Framework was slower and so the stack was overwritten too late to actually cause damage.

TalAloni commented 2 years ago

Thank you for the clear explanation. The only pattern I was able to see so far (what the app was doing that preceded this crash) is Linq usage.