dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.26k stars 4.73k forks source link

Access Violation on x86 #12113

Closed ayende closed 4 years ago

ayende commented 5 years ago

We have a scenario in which we are getting an access violation exception during a particular load in our system. I have a full dump of the process when the crash happened, available here: https://drive.google.com/file/d/11oqZaegxKcoNT8Xj1u9YDIcH7LcmMBsW/view?usp=sharing

The scenario we have is a few servers running and communicating with one another on the same process. This is part of our test setup. We recently started seeing hard failures, such as this one: (d2c.3390): Access violation - code c0000005 (first chance)

The event log reports:

Application: dotnet.exe
CoreCLR Version: 4.6.27317.3
Description: The process was terminated due to an internal error in the .NET Runtime at IP 5A29AA9B (5A230000) with exit code 80131506.

Faulting application name: dotnet.exe, version: 2.1.27130.1, time stamp: 0x5c007ea0
Faulting module name: coreclr.dll, version: 4.6.27317.3, time stamp: 0x5c40c18e
Exception code: 0xc0000005
Fault offset: 0x0006aa9b
Faulting process id: 0x52d4
Faulting application start time: 0x01d4cc3f7d83fa59
Faulting application path: C:\Program Files (x86)\dotnet\dotnet.exe
Faulting module path: C:\Users\ayende\.nuget\packages\runtime.win-x86.microsoft.netcore.app\2.1.8\runtimes\win-x86\native\coreclr.dll
Report Id: ff99edfe-baf1-431a-9ce3-06e987219e6c
Faulting package full name: 
Faulting package-relative application ID: 

This machine has the following hot fixes applies:

PS C:\Windows\SysWOW64> Get-HotFix

Source        Description      HotFixID      InstalledBy          InstalledOn
------        -----------      --------      -----------          -----------
OREN-PC       Update           KB4100347     NT AUTHORITY\SYSTEM  2/18/2019 12:00:00 AM
OREN-PC       Update           KB4343669     NT AUTHORITY\SYSTEM  7/11/2018 12:00:00 AM
OREN-PC       Update           KB4456655     NT AUTHORITY\SYSTEM  9/13/2018 12:00:00 AM
OREN-PC       Security Update  KB4465663     NT AUTHORITY\SYSTEM  11/14/2018 12:00:00 AM
OREN-PC       Security Update  KB4471331     NT AUTHORITY\SYSTEM  12/6/2018 12:00:00 AM
OREN-PC       Security Update  KB4477137     NT AUTHORITY\SYSTEM  12/12/2018 12:00:00 AM
OREN-PC       Security Update  KB4480979     NT AUTHORITY\SYSTEM  1/9/2019 12:00:00 AM
OREN-PC       Security Update  KB4485449     NT AUTHORITY\SYSTEM  2/19/2019 12:00:00 AM
OREN-PC       Security Update  KB4487038     NT AUTHORITY\SYSTEM  2/19/2019 12:00:00 AM
OREN-PC       Security Update  KB4480966     HRHINOS\Ayende       2/21/2019 12:00:00 AM
OREN-PC       Security Update  KB4487017

The actual stack we are seeing is always something similar to:

0:043> kp
 # ChildEBP RetAddr  
00 (Inline) -------- coreclr!VolatileLoad+0x3 [e:\a\_work\335\s\src\inc\volatile.h @ 153] 
01 (Inline) -------- coreclr!Volatile<unsigned long>::Load+0x3 [e:\a\_work\335\s\src\inc\volatile.h @ 292] 
02 (Inline) -------- coreclr!Volatile<unsigned long>::operator unsigned long+0x3 [e:\a\_work\335\s\src\inc\volatile.h @ 346] 
03 (Inline) -------- coreclr!RelativePointer<Module *>::GetValue+0x3 [e:\a\_work\335\s\src\inc\fixuppointer.h @ 68] 
04 (Inline) -------- coreclr!RelativePointer<Module *>::GetValueAtPtr+0x3 [e:\a\_work\335\s\src\inc\fixuppointer.h @ 85] 
05 (Inline) -------- coreclr!ReadPointer+0x3 [e:\a\_work\335\s\src\inc\fixuppointer.h @ 954] 
06 (Inline) -------- coreclr!ReadPointer+0x3 [e:\a\_work\335\s\src\inc\fixuppointer.h @ 954] 
07 (Inline) -------- coreclr!MethodTable::GetLoaderModule+0x3 [e:\a\_work\335\s\src\vm\methodtable.inl @ 176] 
08 (Inline) -------- coreclr!MethodTable::GetLoaderAllocator+0x3 [e:\a\_work\335\s\src\vm\methodtable.inl @ 182] 
09 0e60d0a0 5a299c42 coreclr!VirtualCallStubManager::ResolveWorker(struct StubCallSite * pCallSite = 0x0e60d13c, class Object ** protectedObj = 0x0e60d180, struct DispatchToken token = struct DispatchToken, VirtualCallStubManager::StubKind stubKind = SK_DISPATCH (0n2))+0x6b [e:\a\_work\335\s\src\vm\virtualcallstub.cpp @ 1719] 
0a 0e60d168 5a34a26b coreclr!VSD_ResolveWorker(struct TransitionBlock * pTransitionBlock = 0x0e60d17c, unsigned long siteAddrForRegisterIndirect = 0, unsigned int token = 0xa0000)+0x24f [e:\a\_work\335\s\src\vm\virtualcallstub.cpp @ 1611] 
0b 0e60d190 0e995fca coreclr!ResolveWorkerAsmStub(void)+0x1b [e:\a\_work\335\s\src\vm\i386\virtualcallstubcpu.hpp @ 525] 
WARNING: Frame IP not in any known module. Following frames may be wrong.
0c 0e60d194 5a33afeb 0xe995fca
0d 0e60d1a8 5a47fc7f coreclr!CallJitEHFinallyHelper(void)+0x28 [E:\A\_work\335\s\src\vm\i386\asmhelpers.asm @ 390] 
0e 0e60d200 5a3a2acc coreclr!CallJitEHFinally(class CrawlFrame * pCf = 0x43a07fdc, unsigned char * startPC = <Value unavailable error>, struct EE_ILEXCEPTION_CLAUSE * EHClausePtr = 0x0e60d260, unsigned long nestingLevel = 0)+0xb8 [e:\a\_work\335\s\src\vm\i386\excepx86.cpp @ 3385] 
0f 0e60d2d0 5a249cf5 coreclr!COMPlusUnwindCallback+0x15a70c [e:\a\_work\335\s\src\vm\i386\excepx86.cpp @ 2996] 
10 (Inline) -------- coreclr!Thread::MakeStackwalkerCallback+0x151 [e:\a\_work\335\s\src\vm\stackwalk.cpp @ 877] 
11 0e60d59c 5a2525c1 coreclr!Thread::StackWalkFramesEx(struct REGDISPLAY * pRD = 0x0e60d5c8, <function> * pCallback = 0x0ddb401c, void * pData = 0x0e60d954, unsigned int flags = 4, class Frame * pStartFrame = 0x00000000)+0x1d4 [e:\a\_work\335\s\src\vm\stackwalk.cpp @ 958] 
12 0e60d8d0 5a251f60 coreclr!Thread::StackWalkFrames(<function> * pCallback = 0x5a2483c0, void * pData = 0x0e60d954, unsigned int flags = 4, class Frame * pStartFrame = 0x00000000)+0xa1 [e:\a\_work\335\s\src\vm\stackwalk.cpp @ 1042] 
13 0e60d8f0 5a252a52 coreclr!UnwindFrames(class Thread * pThread = 0x0ddb3ea8, struct ThrowCallbackType * tct = 0x0e60d954)+0x3e [e:\a\_work\335\s\src\vm\excep.cpp @ 2228] 
14 (Inline) -------- coreclr!COMPlusAfterUnwind+0x98 [e:\a\_work\335\s\src\vm\i386\excepx86.cpp @ 482] 
15 0e60db48 5a2512fa coreclr!CPFH_RealFirstPassHandler(struct _EXCEPTION_RECORD * pExceptionRecord = 0x0e60dcb0, struct _EXCEPTION_REGISTRATION_RECORD * pEstablisherFrame = <Value unavailable error>, struct _CONTEXT * pContext = 0x0e60dd00, int bAsynchronousThreadStop = 0n0)+0x459 [e:\a\_work\335\s\src\vm\i386\excepx86.cpp @ 1263] 
16 0e60db88 5a250ef3 coreclr!CPFH_FirstPassHandler(struct _EXCEPTION_RECORD * pExceptionRecord = 0x0e60dcb0, struct _EXCEPTION_REGISTRATION_RECORD * pEstablisherFrame = 0x0e60e650, struct _CONTEXT * pContext = 0x0e60dd00)+0xc3 [e:\a\_work\335\s\src\vm\i386\excepx86.cpp @ 1401] 
17 0e60dbac 7772f1a2 coreclr!COMPlusFrameHandler(struct _EXCEPTION_RECORD * pExceptionRecord = 0x0e60dcb0, struct _EXCEPTION_REGISTRATION_RECORD * pEstablisherFrame = 0x0e60e650, struct _CONTEXT * pContext = 0x0e60dd00, struct _DISPATCHER_CONTEXT * pDispatcherContext = 0x0e60dc38)+0x83 [e:\a\_work\335\s\src\vm\i386\excepx86.cpp @ 1821] 
18 0e60dbd0 7772f174 ntdll!ExecuteHandler2+0x26
19 0e60dc98 7771cd86 ntdll!ExecuteHandler+0x24
1a 0e60dc98 76f21812 ntdll!KiUserExceptionDispatcher+0x26
1b 0e60e1bc 5a252d94 KERNELBASE!RaiseException+0x62
1c 0e60e264 5a3248fb coreclr!RaiseTheExceptionInternalOnly(class Object * throwable = <Value unavailable error>, int rethrow = <Value unavailable error>, int fForStackOverflow = 0n0)+0x11d [e:\a\_work\335\s\src\vm\excep.cpp @ 3039] 
1d 0e60e32c 2ed2b3e4 coreclr!IL_Throw(class Object * obj = <Value unavailable error>)+0x11b [e:\a\_work\335\s\src\vm\jithelpers.cpp @ 4860] 
1e 0e60e33c 0e3bd604 0x2ed2b3e4
1f 0e60e354 0e99a747 0xe3bd604
20 0e60e384 0e999f92 0xe99a747
21 0e60e3b0 0eecd3c9 0xe999f92
22 0e60e400 0e995ad1 0xeecd3c9
23 0e60e578 0e9934e6 0xe995ad1
24 0e60e5b0 0c3c7770 0xe9934e6
25 0e60e5d8 593868dd 0xc3c7770
26 0e60e5f4 59cfa31d System_Threading_Thread+0x68dd
27 0e60e624 59cfc0cc System_Private_CoreLib+0x4ca31d
28 0e60e638 5a33b0ef System_Private_CoreLib+0x4cc0cc
29 0e60e644 5a26fbf1 coreclr!CallDescrWorkerInternal(unsigned long pParams = 0xe60e6b8)+0x34 [E:\A\_work\335\s\src\vm\i386\asmhelpers.asm @ 618] 
2a (Inline) -------- coreclr!CallDescrWorkerWithHandler+0x52 [e:\a\_work\335\s\src\vm\callhelpers.cpp @ 78] 
2b 0e60e6e4 5a32d4e4 coreclr!MethodDescCallSite::CallTargetWorker(unsigned int64 * pArguments = 0x0e60e730, unsigned int64 * pReturnValue = 0x00000000, int cbReturnValue = 0n0)+0x235 [e:\a\_work\335\s\src\vm\callhelpers.cpp @ 620] 
2c 0e60e7bc 5a342b06 coreclr!ThreadNative::KickOffThread_Worker(void * ptr = 0x0e60e948)+0x104 [e:\a\_work\335\s\src\vm\comsynchronizable.cpp @ 260] 
2d 0e60e7d4 5a26f86a coreclr!ManagedThreadBase_DispatchInner(struct ManagedThreadCallState * pCallState = <Value unavailable error>)+0x70 [e:\a\_work\335\s\src\vm\threads.cpp @ 8852] 
2e 0e60e880 5a26f7bb coreclr!ManagedThreadBase_DispatchMiddle(struct ManagedThreadCallState * pCallState = <Value unavailable error>)+0x65 [e:\a\_work\335\s\src\vm\threads.cpp @ 8902] 
2f 0e60e8e4 5a3352b9 coreclr!ManagedThreadBase_DispatchOuter(struct ManagedThreadCallState * pCallState = 0x0e60e8ec)+0x78 [e:\a\_work\335\s\src\vm\threads.cpp @ 9161] 
30 0e60e908 5a2d9e7c coreclr!ManagedThreadBase_FullTransitionWithAD(struct ADID pAppDomain = struct ADID, <function> * pTarget = <Value unavailable error>, void * args = <Value unavailable error>, UnhandledExceptionLocation filterType = ManagedThread (0n2))+0x2f [e:\a\_work\335\s\src\vm\threads.cpp @ 9200] 
31 (Inline) -------- coreclr!ManagedThreadBase::KickOff+0x15 [e:\a\_work\335\s\src\vm\threads.cpp @ 9234] 
32 0e60e984 5a2d9d90 coreclr!ThreadNative::KickOffThread(void * pass = 0x0ad364d0)+0xcc [e:\a\_work\335\s\src\vm\comsynchronizable.cpp @ 380] 
33 0e60f824 76388484 coreclr!Thread::intermediateThreadProc(void * arg = 0x0abac020)+0x50 [e:\a\_work\335\s\src\vm\threads.cpp @ 2255] 
34 0e60f838 77713ab8 KERNEL32!BaseThreadInitThunk+0x24
35 0e60f880 77713a88 ntdll!__RtlUserThreadStart+0x2f
36 0e60f890 00000000 ntdll!_RtlUserThreadStart+0x1b

The managed stack, FWIW, is:

0:043> !clrstack
OS Thread Id: 0x3390 (43)
Child SP       IP Call Site
0e60d0d8 5a29aa9b [GCFrame: 0e60d0d8] 
0e60d118 5a29aa9b [StubDispatchFrame: 0e60d118] System.IDisposable.Dispose()
0e60d198 0e995fca Raven.Server.Rachis.FollowerAmbassador.Run()
0e60e580 0e9934e6 Raven.Server.Rachis.FollowerAmbassador.b__58_0(System.Object)
0e60e584 0c3c8000 Raven.Server.Utils.PoolOfThreads+PooledThread.DoWork()
0e60e5b8 0c3c7770 Raven.Server.Utils.PoolOfThreads+PooledThread.Run()
0e60e5e0 593868dd System.Threading.Thread.ThreadMain_ThreadStart() [E:\A\_work\321\s\corefx\src\System.Threading.Thread\src\System\Threading\Thread.cs @ 93]
0e60e5e8 59cfc00c System.Threading.ThreadHelper.ThreadStart_Context(System.Object) [E:\A\_work\335\s\src\mscorlib\src\System\Threading\Thread.cs @ 62]
0e60e5fc 59cfa31d System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [E:\A\_work\335\s\src\mscorlib\shared\System\Threading\ExecutionContext.cs @ 167]
0e60e630 59cfc0cc System.Threading.ThreadHelper.ThreadStart() [E:\A\_work\335\s\src\mscorlib\src\System\Threading\Thread.cs @ 91]
0e60e718 5a33b0ef [GCFrame: 0e60e718] 
0e60e8a8 5a33b0ef [DebuggerU2MCatchHandlerFrame: 0e60e8a8] 

We are using unsafe code, but we are pretty sure that we aren't corrupting the heap in any manner (lots of tests cover that) and if we were, I would expect to see the failure in different locations.

From trying to figure out what is going on, a few really strange things seem to be happening here:

Here is the actual failure:

FAULTING_IP: 
KERNELBASE!RaiseException+62
76f21812 8b4c2454        mov     ecx,dword ptr [esp+54h]

And the full register usage is:

(d2c.3390): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=00000000 ebx=00e527a0 ecx=00000010 edx=15554140 esi=5a339280 edi=00e527a0
eip=5a29aa9b esp=0e60ce44 ebp=0e60d0a0 iopl=0         nv up ei pl zr na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010246

As you can see, the esp value has a non null value, but checking the memory location with the offset provided to the instruction shows just zeros.

While troubleshooting this, we found a NullReferenceException in our code. We fixed it, but that made the problem go away. We suspect that this is some issue related to error handling inside the CoreCLR during JIT generation. We have run into a different issue with KB4487017 (See: https://github.com/dotnet/coreclr/issues/22597), but we are reproducing this on different versions of Windows and without the KB in question.

We aren't able to reproduce this issue in 64 bits.

arekpalinski commented 5 years ago

Note that 00007ffe6d37600c I used is the IP column from the stack of second dump I took (I've modified your commands accordingly).

Okay, I'll set COMPlus_StressLog=1 and reproduce it again.

janvorli commented 5 years ago

Note that 00007ffe6d37600c I used is the IP column from the stack of second dump I took

Oh, I am sorry, I got mislead by the address being so similar to the previous one. Actually, the address should be taken from the frame of the Voron.Data.Fixed.FixedSizeTree.AddEmbeddedEntry instead. The x86 WinDBG for some reason reports return address at the frame, not the ip address in the frame itself. The x64 is different.

arekpalinski commented 5 years ago

I've realized that I already had COMPlus_StressLog=1 set.

00007ffe0ea8cbaf used in below commands was taken from:

000000707f1bd8d0 00007ffe0ea8cbaf Voron.Data.Fixed.FixedSizeTree.AddEmbeddedEntry(Int64, Boolean ByRef) [C:\Builds\RavenDB-4.1-Nightly\20190228-0530\src\Voron\Data\Fixed\FixedSizeTree.cs @ 666] 

Output of commands:

!u -gcinfo 00007ffe0ea8cbaf https://gist.github.com/arekpalinski/9440a4eff9d3a43d71f9defa786b2287

!gcinfo 00007ffe0ea8cbaf https://gist.github.com/arekpalinski/071e3e9f5c1aea1d6b5683dd25ab8d0a

!ehinfo 00007ffe0ea8cbaf https://gist.github.com/arekpalinski/60d9a7a5784a2d8785a39755bdaed75c

!ip2md 00007ffe0ea8cbaf https://gist.github.com/arekpalinski/8157a2948e5c89e4678006b472c52b51

!dumpil 00007ffe0e6d6380 https://gist.github.com/arekpalinski/f6045b53c2ed65b344a8d01f33db8030

!u didn't output anything so I've tried !u 00007ffe0ea8cbaf https://gist.github.com/arekpalinski/79511a8eb2063ce9de2eafb7374959a0

!u -gcinfo -ehinfo 00007ffe0ea8cbaf https://gist.github.com/arekpalinski/cf5edbe5271816d007346b20099eb71a

!dumplog E:\arek\dump\log.txt https://gist.github.com/arekpalinski/1c938927d7ee7eada5d198951a1546f3

If it would be easier to track this issue with the actual crash dump, please sent us instructions to share it privately.

janvorli commented 5 years ago

I think I should be able to get sufficient details this way. But please keep the crash dump, I may ask you to run more sos commands later.

janvorli commented 5 years ago

Can you please also share output of the following commands?

!clrstack -f
.frame /r 0
.frame /r 3
.frame /c 1
dv
arekpalinski commented 5 years ago

Here you go:

!clrstack -f https://gist.github.com/arekpalinski/87c10e7def3763f020055d8ea0794655

.frame /r 0 https://gist.github.com/arekpalinski/ec17f2f6f82a446d5191123bf78a9b46

.frame /r 3 https://gist.github.com/arekpalinski/da10eed085e4aa84bc1fcb7e99a49fda

.frame /c 1 https://gist.github.com/arekpalinski/67271e863f3e82484e322b40a00e2fcd

dv https://gist.github.com/arekpalinski/f02743b1e8555eeba76511cefa83261d

arekpalinski commented 5 years ago

Any news on this?

janvorli commented 5 years ago

@arekpalinski I was focusing on the x86 failure. Just a minute ago, I was able to get a very simple repro of the x86 issue, however it doesn't repro the issue on x64, so the x64 one is likely something different.

As for the simple repro, see this gist: https://gist.github.com/janvorli/c4b69292d1404a5f6a45340e41739c39

The issue is caused by the fact that the GC info for the exception filter code doesn't protect the IDisposable object created in the using statement. When GC happens during the EH stack walking, it doesn't see that object as alive due to that and collects it. When the Dispose method is called later at the end of the using block, it crashes as the object reference is a garbage.

janvorli commented 5 years ago

@dotnet/jit-contrib can someone please take a look at why we don't report the object as alive in the exception filter in the simple repro I've shared in the previous comment (Please run it with COMPlus_GCStress=4 to repro)?

janvorli commented 5 years ago

@dotnet/jit-contrib I've forgotten to mention that it repros on x86 Windows only.

janvorli commented 5 years ago

@arekpalinski can you please also share the output of the !clrthreads command?

AndyAyersMS commented 5 years ago

I will take a look.

AndyAyersMS commented 5 years ago

@janvorli this repros in both 2.1 and 3.0, correct?

janvorli commented 5 years ago

Yes, correct.

arekpalinski commented 5 years ago

@janvorli What do I need to load to have !clrthreads. Currently it returns:

0:036> !clrthreads
No export clrthreads found
arekpalinski commented 5 years ago

Did you mean !threads? If so here's the output: https://gist.github.com/arekpalinski/da238dd8cbbfb24ce9560dd47347b371

janvorli commented 5 years ago

That's strange, in my sos, both the !threads and !clrthreads work and get the same result. Anyways, what you've shared is what I needed.

AndyAyersMS commented 5 years ago

Just a guess, but I think the issue is that since filters are run "early" in the first EH pass, any local that is live into any handler must also be live into the filter, and we may not get this right on non-funclet EH models like x86.

janvorli commented 5 years ago

@arekpalinski can you please also share output of the following commands?

!pe
.frame /c 0
uf 00007ffe6d37600c  
arekpalinski commented 5 years ago
0:036> !pe
There is no current managed exception on this thread

I made sure I'm in the right thread by calling:

0:036> ~~[0x30f0]s
coreclr!VirtualCallStubManager::ResolveWorker+0x6c:
00007ffe`6d37600c 488b4618        mov     rax,qword ptr [rsi+18h] ds:00027d3c`d612b6a5=????????????????

0:036> .frame /c 0 https://gist.github.com/arekpalinski/5f09c61bf40eaf86809542cf591ffff2

0:036> uf 00007ffe6d37600c https://gist.github.com/arekpalinski/a3759aca7ce8e4d0bed0f5c384d8226a

janvorli commented 5 years ago

@arekpalinski few more things please:

!VerifyHeap
dq 0x000000707f1bd8d0
dq 000000707f1bd960-0x48
arekpalinski commented 5 years ago

@janvorli Here you go

https://gist.github.com/arekpalinski/242ab6bc73c90eb0c6dfc45703d91cf7

janvorli commented 5 years ago

Ok, the VerifyHeap has shown there is quite a large scale heap corruption. Also, from the stress log, I can see that the last GC scan of the current thread was executed when the Voron.Data.Fixed.FixedSizeTree.AddEmbeddedEntry was not on the stack. So that rules out the GC hole in its GC info. My guess is that something unsafe is corrupting memory.

We can get some more detail on that by running the app with env var COMPlus_HeapVerify=1. That results in checking the GC heap consistency before and after each GC. When it asserts due to the heap corruption, please share the stack trace and keep the dump or windbg session so that we can check whether the corruption happened between GCs or during the GC.

ayende commented 5 years ago

My main concern here is that this is running on Win 2019 server, and it crashes consistently. The same code on a Win 2016 server works fine. There is this issue: https://github.com/dotnet/coreclr/issues/22597 , which I believe might be another manifestation of the problem.

AndyAyersMS commented 5 years ago

My understanding is that dotnet/runtime#12038 will result in crashes but not heap corruption.

AndyAyersMS commented 5 years ago

I'm evaluating a candidate fix now. It is not x86 specific, and while I haven't looked in depth yet, I think the issue it addresses would apply to all architectures... will update when I know more.

janvorli commented 5 years ago

It is not x86 specific

The issue is x86 windows specific. We handle that differently on other architectures. The reason is that on other architectures / platforms, when we are running in the filter, we have the try block frame on the call stack too and so we walk it and report the locals from there. Thus we don't need to report them from the filter frame. It is not the case on x86 Windows due to the way how SEH works.

AndyAyersMS commented 5 years ago

The issue is x86 windows specific.

Yes, I see. From the jit's internal standpoint the liveness computation was still wrong for filters on other architectures, but fixing that didn't impact jit codegen or gc info.

PR for the filter liveness fix is up: dotnet/coreclr#23044.

AndyAyersMS commented 5 years ago

@janvorli @ayende @arekpalinski would be great if you could verify the fix if possible...

janvorli commented 5 years ago

@AndyAyersMS I will.

arekpalinski commented 5 years ago

@AndyAyersMS I've compiled CoreCLR 2.1.8 with dotnet/coreclr#23044 fix and tried it on our x64 instance (as I noticed this comment in your PR: https://github.com/dotnet/coreclr/pull/23044#issuecomment-469923643). It's still throwing AVE so it looks like what we experience on x64 is something different.

I've tried the same (CoreCLR 2.1.8 with dotnet/coreclr#23044) on x86 and run the original repro reported by @ayende here - no failure. Great!

@janvorli The usage of COMPlus_HeapVerify=1 didn't make any difference - no assertion due to heap corruption. I got AVE - same stacktrace as before. Heap is corrupted. Any advice how to investigate further on x64?

janvorli commented 5 years ago

@arekpalinski can you please try with COMPlus_HeapVerify=3?

arekpalinski commented 5 years ago

@janvorli We have a lead that heap corruption might be caused by change in our code. We're investigating that.

AndyAyersMS commented 5 years ago

x86 issue seems to be fixed.

Am going to re-open this until we've got more clarity on what is happening for x64.

ayende commented 5 years ago

We are pretty sure that the x64 stuff is our fault and not related to this issue. It just manifested in pretty much the same way and stack trace.

arekpalinski commented 5 years ago

We can confirm that we no longer experience the issue on x64. @janvorli Thanks for help in narrowing it down.

AndyAyersMS commented 5 years ago

Thanks. Now keeping this open to track the (proposed) porting of this fix to 2.1.

BruceForstall commented 5 years ago

@AndyAyersMS Should the milestone be changed to 2.1/2.2?

AndyAyersMS commented 5 years ago

Makes sense, yes.

RussKeldorph commented 5 years ago

@AndyAyersMS @BruceForstall Fixed by dotnet/coreclr#23138?

AndyAyersMS commented 5 years ago

Yes. Also merged to 2.2 via dotnet/coreclr#23256.