Closed ayende closed 4 years ago
Note that 00007ffe6d37600c
I used is the IP column from the stack of second dump I took (I've modified your commands accordingly).
Okay, I'll set COMPlus_StressLog=1 and reproduce it again.
Note that
00007ffe6d37600c
I used is the IP column from the stack of second dump I took
Oh, I am sorry, I got mislead by the address being so similar to the previous one. Actually, the address should be taken from the frame of the Voron.Data.Fixed.FixedSizeTree.AddEmbeddedEntry instead. The x86 WinDBG for some reason reports return address at the frame, not the ip address in the frame itself. The x64 is different.
I've realized that I already had COMPlus_StressLog=1
set.
00007ffe0ea8cbaf
used in below commands was taken from:
000000707f1bd8d0 00007ffe0ea8cbaf Voron.Data.Fixed.FixedSizeTree.AddEmbeddedEntry(Int64, Boolean ByRef) [C:\Builds\RavenDB-4.1-Nightly\20190228-0530\src\Voron\Data\Fixed\FixedSizeTree.cs @ 666]
Output of commands:
!u -gcinfo 00007ffe0ea8cbaf
https://gist.github.com/arekpalinski/9440a4eff9d3a43d71f9defa786b2287
!gcinfo 00007ffe0ea8cbaf
https://gist.github.com/arekpalinski/071e3e9f5c1aea1d6b5683dd25ab8d0a
!ehinfo 00007ffe0ea8cbaf
https://gist.github.com/arekpalinski/60d9a7a5784a2d8785a39755bdaed75c
!ip2md 00007ffe0ea8cbaf
https://gist.github.com/arekpalinski/8157a2948e5c89e4678006b472c52b51
!dumpil 00007ffe0e6d6380
https://gist.github.com/arekpalinski/f6045b53c2ed65b344a8d01f33db8030
!u
didn't output anything so I've tried !u 00007ffe0ea8cbaf
https://gist.github.com/arekpalinski/79511a8eb2063ce9de2eafb7374959a0
!u -gcinfo -ehinfo 00007ffe0ea8cbaf
https://gist.github.com/arekpalinski/cf5edbe5271816d007346b20099eb71a
!dumplog E:\arek\dump\log.txt
https://gist.github.com/arekpalinski/1c938927d7ee7eada5d198951a1546f3
If it would be easier to track this issue with the actual crash dump, please sent us instructions to share it privately.
I think I should be able to get sufficient details this way. But please keep the crash dump, I may ask you to run more sos commands later.
Can you please also share output of the following commands?
!clrstack -f
.frame /r 0
.frame /r 3
.frame /c 1
dv
Here you go:
!clrstack -f
https://gist.github.com/arekpalinski/87c10e7def3763f020055d8ea0794655
.frame /r 0
https://gist.github.com/arekpalinski/ec17f2f6f82a446d5191123bf78a9b46
.frame /r 3
https://gist.github.com/arekpalinski/da10eed085e4aa84bc1fcb7e99a49fda
.frame /c 1
https://gist.github.com/arekpalinski/67271e863f3e82484e322b40a00e2fcd
dv
https://gist.github.com/arekpalinski/f02743b1e8555eeba76511cefa83261d
Any news on this?
@arekpalinski I was focusing on the x86 failure. Just a minute ago, I was able to get a very simple repro of the x86 issue, however it doesn't repro the issue on x64, so the x64 one is likely something different.
As for the simple repro, see this gist: https://gist.github.com/janvorli/c4b69292d1404a5f6a45340e41739c39
The issue is caused by the fact that the GC info for the exception filter code doesn't protect the IDisposable object created in the using
statement. When GC happens during the EH stack walking, it doesn't see that object as alive due to that and collects it. When the Dispose method is called later at the end of the using
block, it crashes as the object reference is a garbage.
@dotnet/jit-contrib can someone please take a look at why we don't report the object as alive in the exception filter in the simple repro I've shared in the previous comment (Please run it with COMPlus_GCStress=4 to repro)?
@dotnet/jit-contrib I've forgotten to mention that it repros on x86 Windows only.
@arekpalinski can you please also share the output of the !clrthreads
command?
I will take a look.
@janvorli this repros in both 2.1 and 3.0, correct?
Yes, correct.
@janvorli What do I need to load to have !clrthreads
. Currently it returns:
0:036> !clrthreads
No export clrthreads found
Did you mean !threads
? If so here's the output:
https://gist.github.com/arekpalinski/da238dd8cbbfb24ce9560dd47347b371
That's strange, in my sos, both the !threads and !clrthreads work and get the same result. Anyways, what you've shared is what I needed.
Just a guess, but I think the issue is that since filters are run "early" in the first EH pass, any local that is live into any handler must also be live into the filter, and we may not get this right on non-funclet EH models like x86.
@arekpalinski can you please also share output of the following commands?
!pe
.frame /c 0
uf 00007ffe6d37600c
0:036> !pe
There is no current managed exception on this thread
I made sure I'm in the right thread by calling:
0:036> ~~[0x30f0]s
coreclr!VirtualCallStubManager::ResolveWorker+0x6c:
00007ffe`6d37600c 488b4618 mov rax,qword ptr [rsi+18h] ds:00027d3c`d612b6a5=????????????????
0:036> .frame /c 0
https://gist.github.com/arekpalinski/5f09c61bf40eaf86809542cf591ffff2
0:036> uf 00007ffe6d37600c
https://gist.github.com/arekpalinski/a3759aca7ce8e4d0bed0f5c384d8226a
@arekpalinski few more things please:
!VerifyHeap
dq 0x000000707f1bd8d0
dq 000000707f1bd960-0x48
@janvorli Here you go
https://gist.github.com/arekpalinski/242ab6bc73c90eb0c6dfc45703d91cf7
Ok, the VerifyHeap has shown there is quite a large scale heap corruption. Also, from the stress log, I can see that the last GC scan of the current thread was executed when the Voron.Data.Fixed.FixedSizeTree.AddEmbeddedEntry was not on the stack. So that rules out the GC hole in its GC info. My guess is that something unsafe is corrupting memory.
We can get some more detail on that by running the app with env var COMPlus_HeapVerify=1
. That results in checking the GC heap consistency before and after each GC. When it asserts due to the heap corruption, please share the stack trace and keep the dump or windbg session so that we can check whether the corruption happened between GCs or during the GC.
My main concern here is that this is running on Win 2019 server, and it crashes consistently. The same code on a Win 2016 server works fine. There is this issue: https://github.com/dotnet/coreclr/issues/22597 , which I believe might be another manifestation of the problem.
My understanding is that dotnet/runtime#12038 will result in crashes but not heap corruption.
I'm evaluating a candidate fix now. It is not x86 specific, and while I haven't looked in depth yet, I think the issue it addresses would apply to all architectures... will update when I know more.
It is not x86 specific
The issue is x86 windows specific. We handle that differently on other architectures. The reason is that on other architectures / platforms, when we are running in the filter, we have the try block frame on the call stack too and so we walk it and report the locals from there. Thus we don't need to report them from the filter frame. It is not the case on x86 Windows due to the way how SEH works.
The issue is x86 windows specific.
Yes, I see. From the jit's internal standpoint the liveness computation was still wrong for filters on other architectures, but fixing that didn't impact jit codegen or gc info.
PR for the filter liveness fix is up: dotnet/coreclr#23044.
@janvorli @ayende @arekpalinski would be great if you could verify the fix if possible...
@AndyAyersMS I will.
@AndyAyersMS I've compiled CoreCLR 2.1.8 with dotnet/coreclr#23044 fix and tried it on our x64 instance (as I noticed this comment in your PR: https://github.com/dotnet/coreclr/pull/23044#issuecomment-469923643). It's still throwing AVE so it looks like what we experience on x64 is something different.
I've tried the same (CoreCLR 2.1.8 with dotnet/coreclr#23044) on x86 and run the original repro reported by @ayende here - no failure. Great!
@janvorli
The usage of COMPlus_HeapVerify=1
didn't make any difference - no assertion due to heap corruption. I got AVE - same stacktrace as before. Heap is corrupted. Any advice how to investigate further on x64?
@arekpalinski can you please try with COMPlus_HeapVerify=3
?
@janvorli We have a lead that heap corruption might be caused by change in our code. We're investigating that.
x86 issue seems to be fixed.
Am going to re-open this until we've got more clarity on what is happening for x64.
We are pretty sure that the x64 stuff is our fault and not related to this issue. It just manifested in pretty much the same way and stack trace.
We can confirm that we no longer experience the issue on x64. @janvorli Thanks for help in narrowing it down.
Thanks. Now keeping this open to track the (proposed) porting of this fix to 2.1.
@AndyAyersMS Should the milestone be changed to 2.1/2.2?
Makes sense, yes.
@AndyAyersMS @BruceForstall Fixed by dotnet/coreclr#23138?
Yes. Also merged to 2.2 via dotnet/coreclr#23256.
We have a scenario in which we are getting an access violation exception during a particular load in our system. I have a full dump of the process when the crash happened, available here: https://drive.google.com/file/d/11oqZaegxKcoNT8Xj1u9YDIcH7LcmMBsW/view?usp=sharing
The scenario we have is a few servers running and communicating with one another on the same process. This is part of our test setup. We recently started seeing hard failures, such as this one:
(d2c.3390): Access violation - code c0000005 (first chance)
The event log reports:
This machine has the following hot fixes applies:
The actual stack we are seeing is always something similar to:
The managed stack, FWIW, is:
We are using unsafe code, but we are pretty sure that we aren't corrupting the heap in any manner (lots of tests cover that) and if we were, I would expect to see the failure in different locations.
From trying to figure out what is going on, a few really strange things seem to be happening here:
Here is the actual failure:
And the full register usage is:
As you can see, the
esp
value has a non null value, but checking the memory location with the offset provided to the instruction shows just zeros.While troubleshooting this, we found a
NullReferenceException
in our code. We fixed it, but that made the problem go away. We suspect that this is some issue related to error handling inside the CoreCLR during JIT generation. We have run into a different issue with KB4487017 (See: https://github.com/dotnet/coreclr/issues/22597), but we are reproducing this on different versions of Windows and without the KB in question.We aren't able to reproduce this issue in 64 bits.