Hard crash & AccessViolationException on gc_heap::mark_object_simple

ayende commented 8 years ago

Sometimes, when running our tests, we get the following error:

Unhandled exception at 0x00007FFF65DC86BA (coreclr.dll) in dotnet.exe: 0xC0000005: Access violation reading location 0xFFFFFFFFFFFFFFFF.

We write code that handles unmanaged memory, so it might be what we are doing, but this only happens occasionally, and only ever during test startup (before we actually get to actually start running our tests).

The crashing stack is:

>   coreclr.dll!WKS::gc_heap::mark_object_simple(unsigned char * *) Unknown
    coreclr.dll!WKS::gc_heap::mark_through_cards_for_large_objects(void (*)(unsigned char * *),int) Unknown
    coreclr.dll!WKS::gc_heap::mark_phase(int,int)   Unknown
    coreclr.dll!WKS::gc_heap::gc1(void) Unknown
    coreclr.dll!WKS::gc_heap::garbage_collect(int)  Unknown
    coreclr.dll!WKS::GCHeap::GarbageCollectGeneration(unsigned int,enum WKS::gc_reason) Unknown
    coreclr.dll!WKS::gc_heap::allocate_more_space(struct alloc_context *,unsigned __int64,int)  Unknown
    coreclr.dll!WKS::GCHeap::Alloc(struct alloc_context *,unsigned __int64,unsigned int)    Unknown
    coreclr.dll!Alloc(unsigned __int64,int,int) Unknown
    coreclr.dll!AllocateObject(class MethodTable *,bool)    Unknown
    coreclr.dll!JIT_New(struct CORINFO_CLASS_STRUCT_ *) Unknown
    00007fff078ddb59()  Unknown
    System.Private.CoreLib.ni.dll!00007fff659f6a98()    Unknown
    000001ae1e6f1ef0()  Unknown
    000001ae1ef35db0()  Unknown
    000001ae1e6f1ef0()  Unknown

Is there any additional that I can provide?

It is sometimes reproducible when using this repository's commit here:

https://github.com/ayende/ravendb/commit/592a307920a2a154731fc3447fcf3b9637d35d68

Run dotnet restore and the in the test/FastTests run dotnet test. It seems that it happens more often if it needs to compile the code, but I'm not sure if this is the case.

jkotas commented 8 years ago

cc @swgillespie

ayende commented 8 years ago

Okay, I managed to take a crash dump for this, same stack trace. You can get it here: https://drive.google.com/file/d/0B-GYDT6Flp-sa1g1amNMakdqb0E/view?usp=sharing

jkotas commented 8 years ago

This looks like classic managed heap corruption caused by unsafe or unmanaged code.

To root cause these, it is best to start with VerifyHeap - it will often give you hints where the corruption happened.

0:013> !VerifyHeap
Could not request method table data for object 000001E75A4FC368 (MethodTable: 0102020302010200).
Last good object: 000001E75A4FC328.
object 000001e76a226a40: bad member 000001E75A4FC368 at 000001E76A227300
Last good object: 000001E76A226A20.
Failed to request SyncBlk at index 29.

Let's see what the last good object is (VerifyHeap scans the heap from low addresses to high addresses). Last good object is often what got corrupted:

0:013> !do 000001e75a4fc328
Name:        System.Byte[]
MethodTable: 00007fff659bc3e8
EEClass:     00007fff65234aa8
Size:        64(0x40) bytes
Array:       Rank 1, Number of elements 40, Type Byte (Print Array)

0:013> db  000001e75a4fc328 
000001e7`5a4fc328  e8 c3 9b 65 ff 7f 00 00-28 00 00 00 00 00 00 00  ...e....(.......
000001e7`5a4fc338  03 03 02 02 03 00 02 00-02 01 00 01 03 01 03 00  ................
000001e7`5a4fc348  03 00 02 00 02 01 03 01-00 03 00 00 02 00 01 02  ................
000001e7`5a4fc358  02 02 03 02 00 00 02 01-02 02 00 03 00 00 03 02  ................
000001e7`5a4fc368  01 02 01 02 03 02 02 01-03 00 01 00 03 01 01 02  ................
000001e7`5a4fc378  00 02 02 01 02 01 00 03-02 00 02 00 01 03 01 03  ................
000001e7`5a4fc388  03 02 03 00 02 02 02 00-03 02 02 00 02 02 02 00  ................
000001e7`5a4fc398  00 01 03 01 65 00 64 00-2c 00 20 00 6f 00 72 00  ....e.d.,. .o.r.

The byte[40] array ends at 000001e7`5a4fc360, but looking at the pattern - somebody actually wrote 100 bytes into it.

jkotas commented 8 years ago

Also interesting is that the next object after the corrupted array was likely error message: "...ed or invalid length has been given.". It may be useful to look for a place where this error message is used for the first time - the corruption likely happened nearby. Hopefully, this should give you enough clues where to look.

ayende commented 8 years ago

Thanks, we'll look at that

ayende commented 8 years ago

Thanks, we can confirm that this is very likely something that we are doing (still investigating what exactly is causing it), thanks for pointing us in the right direction.

ayende commented 8 years ago

Hm... Still probably our fault, but I would love to have your input on that. We removed the code that we suspect cause this, and then stress tested it, it finally died in:

>   coreclr.dll!WKS::gc_heap::plan_phase(int)   Unknown
    coreclr.dll!WKS::gc_heap::gc1(void) Unknown
    coreclr.dll!WKS::gc_heap::garbage_collect(int)  Unknown
    coreclr.dll!WKS::GCHeap::GarbageCollectGeneration(unsigned int,enum WKS::gc_reason) Unknown
    coreclr.dll!WKS::GCHeap::GarbageCollect(int,int,int)    Unknown
    coreclr.dll!GCInterface::Collect(int,int)   Unknown

At this point, VerifyHeap / DumpHeap are not going to produce good output, because it is in plan phase.

Any suggestions as to how to approach it?

If this helps, the dump file for that is: https://drive.google.com/file/d/0B-GYDT6Flp-sZEx3RC13NE45UkE/view?usp=sharing

jkotas commented 8 years ago

00007ff8`910bcf08 mov     rcx,qword ptr [rdi]
00007ff8`910bcf0b and     rcx,0FFFFFFFFFFFFFFFEh
00007ff8`910bcf0f mov     qword ptr [rbp-28h],rdi
00007ff8`910bcf13 test    dword ptr [rcx],80000000h ds:00020202`00030202=????????

0002020200030202` is the same byte pattern in the target address of the crash as in the earlier dump. It suggest that the problem is still there. If you poke at the dump a bit, you can find similar byte array that seems to be written to beyond the end, like in the previous dump:

0:012> !do 00000216`cbc34740
Name:        System.Byte[]
MethodTable: 00007ff890cac3e8
EEClass:     00007ff890524aa8
Size:        64(0x40) bytes
Array:       Rank 1, Number of elements 40, Type Byte (Print Array)
Content:     ........................................
Fields:
None
0:012> db 00000216`cbc34740
00000216`cbc34740  e8 c3 ca 90 f8 7f 00 00-28 00 00 00 00 00 00 00  ........(.......
00000216`cbc34750  03 03 02 02 03 00 02 00-02 01 00 01 03 01 03 00  ................
00000216`cbc34760  03 00 02 00 02 01 03 01-00 03 00 00 02 00 01 02  ................
00000216`cbc34770  02 02 03 02 00 00 02 01-02 02 00 03 00 00 03 02  ................
00000216`cbc34780  10 04 00 00 00 00 00 00-00 00 00 00 00 00 00 00  ................
00000216`cbc34790  00 00 00 00 02 01 00 03-02 00 02 00 01 03 01 03  ................
00000216`cbc347a0  02 02 03 00 02 02 02 00-03 02 02 00 02 02 02 00  ................
00000216`cbc347b0  00 01 03 01 16 02 00 00-00 10 00 00 00 00 00 00  ................

jkotas commented 8 years ago

We removed the code that we suspect cause this

What was the code that you have removed?

ayende commented 8 years ago

It is an optimized version of LZ4. The relevant commits doing the revert are: https://github.com/ayende/ravendb/commit/22518f150b3c8ada2d68bf9599b0643c9d99a766 https://github.com/ayende/ravendb/commit/c1af384e295c95a3b030e4aa71c8bf4e7fb798d5

But I think that this issue still exists, before the revert, this would reproduce in about 10 runs of our tests. Without this, it would reproduce in one in 80 or so.

Still suspect that this is the native LZ4 doing something not quite right.

ayende commented 8 years ago

Okay, this change fixed the problem: https://github.com/ayende/ravendb/commit/f876c42ae9df54f12bf94ec835cc9a1b64b5d8df

And it make no sense. This is a change in a test, not in the actual code, and the only thing it did was give us a bit more space to put a marker to see if we write past the end of the buffer. We don't do that.

What is stranger is that there is absolutely no threading involved at all in the code / test, so this shouldn't be something that is racy.

What really worries me is: https://github.com/ayende/ravendb/commit/f876c42ae9df54f12bf94ec835cc9a1b64b5d8df#diff-fe95ddc8683ab5424025dfc30d89b743L60

This error could happen in the way it does if the fixed statement wasn't respected, and the memory was moved while the code was modifying the memory.

AndyAyersMS commented 8 years ago

In the Compress test, is there a mismatch between the size variable and the input size? It looks like the input array is larger than 40 bytes...

ayende commented 8 years ago

Um... yes, and that explains everything.

I'll go away and be stupid by myself now.

Thank you so very much for the help.

dotnet / runtime

Hard crash & AccessViolationException on gc_heap::mark_object_simple #6272