corngood / kill-ryzen-win

23 stars 0 forks source link

Notes on crypt32.dll crash #1

Open corngood opened 6 years ago

corngood commented 6 years ago

We've got two Ryzen systems in our office, and both of them have been having intermittent crashes in cmd.exe. Sometimes this will take down an interactive shell, sometimes a subshell, but you'll usually get an error like this. There are also occasional random crashes in other places, such as internal errors in microsoft's C compiler (cl.exe), but most of the time it's cmd.exe crashing.

I've eliminated everything I could think of in software, so my next step is to replace the CPU. Both machines get segfaults using kill-ryzen.sh, so I'm going to try to find a replacement that's more stable and see if it also fixes my problems in windows.

The call-stack is usually something like:

crypt32.dll!VerifyStackAvailable() Unknown
crypt32.dll!EnumRegFuncCallback(unsigned long,char const *,char const *,unsigned long,unsigned long const * const,unsigned short const * const * const,unsigned char const * const * const,unsigned long const * const,void *)  Unknown
crypt32.dll!CryptEnumOIDFunction() Unknown
crypt32.dll!LoadRegFunc(struct _FUNC_SET *) Unknown
crypt32.dll!CryptGetOIDFunctionAddress()   Unknown
crypt32.dll!CryptSIPVerifyIndirectData()   Unknown
wintrust.dll!SoftpubLoadMessage()  Unknown
wintrust.dll!WinVerifyTrust()  Unknown
wintrust.dll!WinVerifyTrust()  Unknown
advapi32.dll!__CodeAuthzpIdentifyOneCodeAuthzLevel()   Unknown
advapi32.dll!SaferIdentifyLevel()  Unknown
kernel32.dll!BasepCheckWinSaferRestrictions()   Unknown
KernelBase.dll!CreateProcessInternalW() Unknown
KernelBase.dll!CreateProcessW()    Unknown
kernel32.dll!CreateProcessWStub()  Unknown
cmd.exe!ExecPgm(struct cmdnode *,unsigned int,unsigned int,unsigned short const *,unsigned short const *,unsigned short const *)    Unknown
cmd.exe!ECWork(struct cmdnode *,unsigned int,unsigned int)  Unknown
cmd.exe!FindFixAndRun(struct cmdnode *) Unknown
cmd.exe!Dispatch(int,struct node *) Unknown
cmd.exe!BatLoop(struct batdata *,struct cmdnode *)  Unknown
cmd.exe!BatProc(struct cmdnode *,unsigned short *,int,int)  Unknown
cmd.exe!ECWork(struct cmdnode *,unsigned int,unsigned int)  Unknown
cmd.exe!FindFixAndRun(struct cmdnode *) Unknown
cmd.exe!Dispatch(int,struct node *) Unknown
cmd.exe!main() Unknown
cmd.exe!wil::details_abi::ProcessLocalStorage<struct wil::details_abi::ProcessLocalData>::~ProcessLocalStorage<struct wil::details_abi::ProcessLocalData>(void) Unknown
kernel32.dll!BaseThreadInitThunk() Unknown
ntdll.dll!RtlUserThreadStart() Unknown

I've seen it both in the 32-bit and 64-bit cmd.exe, though it seems more stable in 32-bit.

crypt32.dll is trying to allocate a ~1MiB buffer on the stack using alloca, according to the g_ulMaxStackAllocSize variable. Interestingly, g_ulMaxStackAllocSize seems to be uninitialised by the crypt32.dll!DllMain. If you launch cmd.exe in a debugger, it gets initialised to zero, and the stack allocation never takes place.

So it's trying to allocate 1MB from the stack, which is set to 1MB (reserve, commit is less) for cmd.exe. _chkstk (VC/crt/src/i386/chkstk.asm) gets called with 0xfc000 bytes, and it walks down the stack, touching each page. If the page is in the stack reserve, but not committed, the kernel it commits it, growing the stack. If it goes past the bottom of the stack, it's supposed to throw a stack overflow exception.

This is where things seem to go weird on Ryzen. Occasionally the process will get an access violation from the stack probe in _chkstk, when it should be getting a stack overflow (which crypt32.dll handles and falls back to heap allocation).

This all seems very similar to what's happening to gcc in kill-ryzen.sh.

corngood commented 6 years ago

We've found that authenticode needs to be enabled for this crash to happen, which makes sense based on the call stack.

ghost commented 6 years ago

Well, tested on one Ryzen from 2217 (pre-fix). Crashed almost immediately on test.

If anyone thinks about solution not requiring RMA, there is one: I've disabled CPU micro Op Cache (called uOpCache/Op Cache/...). On i.e. ASUS X370 boards this requires a modded BIOS enabling AMD CBS menu, some other board may allow that on stock BIOS.

After disabling uOpCache, tests (both Linux kill-ryzen and this one) stopped crashing and could run for hours without any crash until terminated manually. Performance loss is negligible (around ~3% on 7zip and compilation times, winrar seemingly even slightly (~1%) benefits from it in multi-threaded mode). According to some people noticing slight performance drops on 'fixed' Ryzens, these probably just come with uOpCache internally disabled or limited.

corngood commented 6 years ago

@alexat Thanks for posting.

My experience is that the post-RMA chip (week 39) no longer crashes in the linux test, but still crashes with this one.

AMD also confirmed they could reproduce this crash (about six weeks ago?) and promised to send out a new (???) chip for us to test, but have gone completely silent on our support ticket since then.

I couldn't disable uOpCache on our original boards, but we recently got a X370, so I'll see if I can find a way to test that.

ghost commented 6 years ago

For ASUS, there is modding process to open CBS menu described: https://puissanceled.com/vrac/Bios_modding/EN.html

For other boards, may differ due to BIOS vendor / version / etc., but overall should be alike.