Open derekbruening opened 9 years ago
Locally, running the whole shard 10x in a loop => no repro. Running individual tests 28x => no repro. On the bot, running individual tests 20x => no repro.
Lowering priority to medium b/c we can't reproduce it.
Could DR's cache cons have made this page readonly? Xref #1354
Hit on Chromium bots: https://code.google.com/p/chromium/issues/detail?id=514921
Crash info:
Also crashed on TaskManagerTest.RefreshCalled:
These tests themselves are not new and have not been changed recently.
This crash is non-deterministic: went away and came back on bot #3, where builds 7377-73780 are green (and the other 2 bots that shard unit_tests don't have the crash then).
This bot has been purple a lot. I actually see this crash further back: builds 7330 and 7332 on bot #3 from July 23. The crashes could go back even further than that.
Logging in to the bot and running just this test or all 3 TaskManagerTest.* using the same args as the scripts: the tests run just fine with no crash.
Is it symbol cache corruption (https://github.com/DynamoRIO/drmemory/issues/1465) which can cause weird crashes? On the bot in the AppData/LocalLow/drmemory.symcache directory: $ grep 157db * msvcrt.dll.txt:_CrtDbgReport,0x157db msvcrt.dll.txt:_CrtDbgReportW,0x157db msvcrt.dll.txt:_CrtDbgReportV,0x157db msvcrt.dll.txt:_CrtDbgReportWV,0x157db msvcrt.dll.txt:_CrtSetDbgFlag,0x157db msvcrt.dll.txt:_crtDbgFlag,0x157db
So doesn't look like it. I made a copy of the original symcache dir on the bot (bug_514921/) and cleared out the old one just in case.
Symbolizing the crash call stack:
Passes ptr in ecx, size in edx.
So it's this line:
Failed to write to a new page (0x00000001 0x54130000). Still has eax=4 * 16 = 64 bytes (+ more if non-16-aligned) left to zero. Original size edx=0x60? but that's <0x80. edx could be modified before crash if orig ptr not aligned to 16.
Very strange: if the allocator really messes up this badly and has an unwritable page in the middle of a new alloc, wouldn't we see a lot more problems? Is there really some crazy free list corner case that's this rare? We haven't updated DrMem in a while so there was some change in Cr that changed its alloc pattern to suddenly trigger this weird bug?
I'm still trying to reproduce locally: running individual tests in a loop had no success so I am now running this same set of unit_tests subtests sharded in a loop. No repro so far.