[Arm32/Linux] GS cookie corruption when running some corefx tests.

janvorli commented 5 years ago

While running release build of corefx tests on checked build of coreclr on ARM32 (tested on my RPi3 with Raspbian), I have found that couple of corefx test suites fail due to GS cookie corruption detected at GC stack walk time. This happens:

in System.Globalization.Extensions.Tests for GS cookie in the frame of System.Globalization.IdnMapping.GetAsciiCore
in System.Collections.Specialized.Tests and System.Data.Common.Tests for GS cookie in the frame of System.Globalization.CompareInfo.GetHashCodeOfStringCore

The issue reproes in 80..100% runs of the test suites. I was trying to debug both cases, but the functions with corrupted GS cookies are called many times before the issue reproes, so I cannot use something as simple as memory watchpoint to find who's corrupting the cookie.

Unfortunately, LLDB / sos plugin on this platform are quite unstable together, so e.g. the clrstack sos command kills LLDB. At least the ip2md works so that I can see what's on the managed stack.

Here is a call stack of the thread with the System.Globalization.IdnMapping.GetAsciiCore on the stack when another thread runs GC and finds the corrupted cookie:

* thread dotnet/runtime#3859, name = 'dotnet'
  * frame #0: 0x76f9b9a4 libpthread.so.0`__pthread_cond_wait(cond=0x001317a0, mutex=0x00131788) at pthread_cond_wait.c:188
    frame dotnet/coreclr#1: 0x768692a8 libcoreclr.so`CorUnix::CPalSynchronizationManager::ThreadNativeWait(ptnwdNativeWaitData=0x00131788, dwTimeout=4294967295, ptwrWakeupReason=0x674ffac8, pdwSignaledObject=0x674ffac4) at synchmanager.cpp:479:28
    frame dotnet/coreclr#2: 0x76868814 libcoreclr.so`CorUnix::CPalSynchronizationManager::BlockThread(this=0x0005e4c8, pthrCurrent=0x001315e8, dwTimeout=4294967295, fAlertable=false, fIsSleep=false, ptwrWakeupReason=0x674ffd68, pdwSignaledObject=0x674ffd90) at synchmanager.cpp:302:22
    frame dotnet/coreclr#3: 0x7687580c libcoreclr.so`CorUnix::InternalWaitForMultipleObjectsEx(pThread=0x001315e8, nCount=1, lpHandles=0x674ffe90, bWaitAll=NO, dwMilliseconds=4294967295, bAlertable=NO, bPrioritize=NO) at wait.cpp:636:45
    frame dotnet/coreclr#4: 0x768761be libcoreclr.so`::WaitForSingleObjectEx(hHandle=0x00000084, dwMilliseconds=4294967295, bAlertable=NO) at wait.cpp:139:13
    frame dotnet/coreclr#5: 0x764b1a34 libcoreclr.so`CLREventWaitHelper2(handle=0x00000084, dwMilliseconds=4294967295, alertable=NO) at synch.cpp:377:12
    frame dotnet/coreclr#6: 0x764b1924 libcoreclr.so`CLREventWaitHelper(this=0x674fff3c, pParam=0x674fff44)::$_1::operator()(CLREventWaitHelper(void*, unsigned int, int)::Param*) const at synch.cpp:402:26
    frame dotnet/coreclr#7: 0x764b073e libcoreclr.so`CLREventWaitHelper(handle=0x00000084, dwMilliseconds=4294967295, alertable=NO) at synch.cpp:404:5
    frame dotnet/coreclr#8: 0x764b06a6 libcoreclr.so`CLREventBase::WaitEx(this=0x000bbb68, dwMilliseconds=4294967295, mode=WaitMode_None, syncState=0x00000000) at synch.cpp:471:20
    frame dotnet/coreclr#9: 0x764b0550 libcoreclr.so`CLREventBase::Wait(this=0x000bbb68, dwMilliseconds=4294967295, alertable=NO, syncState=0x00000000) at synch.cpp:417:12
    frame dotnet/coreclr#10: 0x765d080a libcoreclr.so`GCEvent::Impl::Wait(this=0x000bbb68, timeout=4294967295, alertable=false) at gcenv.os.cpp:1153:24
    frame dotnet/coreclr#11: 0x765cff6c libcoreclr.so`GCEvent::Wait(this=0x000bbb50, timeout=4294967295, alertable=false) at gcenv.os.cpp:1231:20
    frame dotnet/coreclr#12: 0x766893b6 libcoreclr.so`WKS::GCHeap::WaitUntilGCComplete(this=0x00060c60, bConsiderGCStart=false) at gcee.cpp:309:40
    frame dotnet/coreclr#13: 0x764b53e0 libcoreclr.so`Thread::RareDisablePreemptiveGC(this=0x00130860) at threadsuspend.cpp:2576:60
    frame dotnet/coreclr#14: 0x7642109c libcoreclr.so`::JIT_PInvokeEndRarePath() at jithelpers.cpp:5456:13
    frame dotnet/coreclr#15: 0x6fff238c - this is System.Globalization.IdnMapping.GetAsciiCore(System.String, Char*, Int32)
    frame dotnet/coreclr#16: 0x6ffeed3e - this is System.Globalization.IdnMapping.GetAscii(System.String, Int32, Int32)
    frame dotnet/coreclr#17: 0x6ffeec2e - this is System.Globalization.IdnMapping.GetAscii(System.String, Int32)
    frame dotnet/coreclr#18: 0x6ffeec00 - this is System.Globalization.IdnMapping.GetAscii(System.String)
    frame dotnet/coreclr#19: 0x66c4dc8c - this is Xunit.Assert.All[[System.__Canon, System.Private.CoreLib]](System.Collections.Generic.IEnumerable`1<System.__Canon>, System.Action`1<System.__Canon>)
    frame dotnet/coreclr#20: 0x66c4c4c2 - this is System.Globalization.Tests.IdnMappingIdnaConformanceTests.GetAscii_Success()
    frame dotnet/coreclr#21: 0x764fab22 libcoreclr.so`CallDescrWorkerInternal at asmhelpers.S:79

Disassembling the System.Globalization.IdnMapping.GetAsciiCore, I can see that the GS cookie location matches what the stack walker expects. But instead of having 0x12345678 in the cookie, there is a "random" value at the point of failure.

The same is true for the System.Globalization.CompareInfo.GetHashCodeOfStringCore.

The failure in both of the test suites and the stack traces (at least the frame with corrupted GS cookie and all other frames towards the TOS) is always the same.

janvorli commented 5 years ago

cc: @BruceForstall

RussKeldorph commented 5 years ago

@wfurt Could you take a look at this?

Maoni0 commented 5 years ago

I don't think this belongs to area-gc... @RussKeldorph which area should this goto?

RussKeldorph commented 5 years ago

@tarekgh Could someone in CoreFX look at this since it appears to be specific to this System.Globalization code, which is doing stackallocs and P/Invoking to native code, among other things?

RussKeldorph commented 5 years ago

@danmosemsft

danmoseley commented 5 years ago

@adamsitnik

adamsitnik commented 5 years ago

I have taken a look at both methods:

In GetHashCodeOfStringCore we allocate a block of memory of a const size on the stack:

https://github.com/dotnet/coreclr/blob/9b0dcaa4bbf1440132cdbb3dfcc950afaef04656/src/System.Private.CoreLib/shared/System/Globalization/CompareInfo.Unix.cs#L855-L856

pin it (the span can be also an array from ArrayPool) and then send it to a native method providing a proper length:

https://github.com/dotnet/coreclr/blob/9b0dcaa4bbf1440132cdbb3dfcc950afaef04656/src/System.Private.CoreLib/shared/System/Globalization/CompareInfo.Unix.cs#L863

In GetAsciiCore we allocate a block of memory of a variable size on the stack:

https://github.com/dotnet/coreclr/blob/9773db1e7b1acb3ec75c9cc0e36bd62dcbacd6d5/src/System.Private.CoreLib/shared/System/Globalization/IdnMapping.Unix.cs#L19-L26

and also send it to a native method providing a proper length:

https://github.com/dotnet/coreclr/blob/9773db1e7b1acb3ec75c9cc0e36bd62dcbacd6d5/src/System.Private.CoreLib/shared/System/Globalization/IdnMapping.Unix.cs#L27

So it looks like from the managed code perspective we don't have any buffer overrun.

I am not an expert (I had to find out what GS Cookie really is) but is it not more a code-gen issue? Similar to https://github.com/dotnet/coreclr/pull/15087 ?

tarekgh commented 5 years ago

@RussKeldorph I believe this is something more related to the runtime and specific to arm-32 . This is not specific to Globalization tests only but it also Collections, and System.Data. I have changed the tag to code-gen which would be reasonable to have them take a look.

wfurt commented 5 years ago

I'm out of office until 7/25 @RussKeldorph. It seems like @adamsitnik has good handle on this.

BruceForstall commented 5 years ago

I tried to reproduce the GetAsciiCore problem, but was unable to. Actually, when running the GetAscii_Success test, it never even JITs GetAscii or GetAscii core, so it seems my build doesn't return anything in the "Where" clause here:

 public void GetAscii_Success()
 {
     Assert.All(Factory.GetDataset().Where(e => e.ASCIIResult.Success), entry =>

BruceForstall commented 5 years ago

@tarekgh @adamsitnik It appears the System.Globalization.IdnMapping.GetAsciiCore api ends up calling a platform-specific install of libicuuc. Is this just expected to exist on the platform? This could presumably differ by architecture/OS, right? So potentially this could be a platform-specific native code bug in this external library?

BruceForstall commented 5 years ago

fwiw, I also don't see GetHashCodeOfStringCore JIT in runs of System.Collections.Specialized.Tests and System.Data.Common.Tests, so I'm not sure what I'm doing wrong here.

tarekgh commented 5 years ago

It appears the System.Globalization.IdnMapping.GetAsciiCore api ends up calling a platform-specific install of libicuuc.

The whole Globalization depends on ICU library. This is not different than other dependencies like openssl or crypto. Do you mean the native ICU library call can result in writing on the stack? this seams unlikely but I cannot tell for sure this is happening. I guess if this happen, we would see the problem in many other places as the IDNA code is used by the networking stack for instance. If it is possible we can break on the memory when the GS cookie got overwritten that can make it easier to know who is doing that. is this something can easily be done?

janvorli commented 5 years ago

I am back from my vacation, so I can continue looking into the issue.

BruceForstall commented 5 years ago

@tarekgh Hopefully Jan can make progress investigating this, but he said above:

but the functions with corrupted GS cookies are called many times before the issue reproes, so I cannot use something as simple as memory watchpoint to find who's corrupting the cookie.

so it might not be easy/possible to just set a data breakpoint on the GS cookie and see who modifies it.

In the tests identified above, a local buffer is allocated, pinned, and passed to the native code. If the code (or the p/invoke stub layer?) writes beyond the end of the allocated buffer, that's when the GS cookie would get overwritten. Note that there is only one GS cookie per frame (that requires one), and it's not precisely immediately following the "unsafe buffer" (e.g., 'stackalloc' buffer), so code can overwrite the buffer "a little" and not necessarily hit the GS cookie (but still corrupt part of the frame).

I agree that it seems unlikely that the native code has an issue, but the fact that a platform-specific bug might be due to platform-specific component (and even OS distribution specific component?) makes it more likely in my mind.

tarekgh commented 5 years ago

but the fact that a platform-specific bug might be due to platform-specific component (and even OS distribution specific component?) makes it more likely in my mind.

That is possible. I hope @janvorli will have news soon.

danmoseley commented 5 years ago

cc @jkoritzinsky @JeremyKuhne in case one of them sees any issue with the interop being done here.

janvorli commented 5 years ago

I have found the culprit. First I have found that the cookie location that we compute is in the middle of the stackalloc-ed buffer in both the System.Globalization.IdnMapping.GetAsciiCore and System.Globalization.CompareInfo.GetHashCodeOfStringCore cases. The GS cookie offset is decoded in EECodeManager::GetGSCookieAddr relative to the caller SP. For methods with stackalloc, the R9 is used to save the SP value at the end of the prolog. That way, the unwinder can compute the caller SP based on the R9. When we start stack walking at InlinedCallFrame though, we don't have the R9 stored in it and when we extract the REGDISPLAY from the InlinedCallFrame in InlinedCallFrame::UpdateRegDisplay, we set R9 to the InlinedCallFrame::m_pCallSiteSP, the same value we set the SP to. That value though is the SP at the call site of the pinvoke, which is a wrong value for R9 for functions with stackalloc, as the SP was already updated by the stackalloc at that point. We then use unwinder to get the caller's SP as a base for getting the GS cookie address. The unwinder starts from the wrong R9 and so it obtains wrong caller's SP.

It seems that we will need to add a new field and save the R9 to the InlinedCallFrame for arm32 and update the arm32 version of InlinedCallFrame::UpdateRegDisplay accordingly.

janvorli commented 5 years ago

@jkotas this is the issue I've mentioned to you today.

BruceForstall commented 5 years ago

@janvorli There is a comment in the JIT in lower.cpp, InsertPInvokeMethodProlog(), that says:

non-x86: method prolog (SP remains
constant in function, after prolog: no
localloc and PInvoke in same function)

So apparently we don't expect to see InlinedCallFrame when there is also localloc?

BruceForstall commented 5 years ago

The use of R9 shouldn't be hard-coded by the VM; it is passed in the unwind codes to specify it as the frame pointer.

janvorli commented 5 years ago

R9 is actually not used as a frame pointer here. The R11 is a frame pointer. The R9 just saves the SP after prolog and based on the comments in the source, it is hardcoded. In JIT, it is represented by REG_SAVED_LOCALLOC_SP define. The code that extracts REGDISPLAY from the InlinedCallFrame also mentiones it: https://github.com/dotnet/coreclr/blob/48ff0937552e540f21835391b693daf47ffabece/src/vm/arm/stubs.cpp#L2375-L2378

BruceForstall commented 5 years ago

We need to make sure target.h in the JIT gets a similar comment about keeping REG_SAVED_LOCALLOC_SP in sync with the VM, to point at src/vm/arm/stubs.cpp.

janvorli commented 5 years ago

So apparently we don't expect to see InlinedCallFrame when there is also localloc

Ah, that would make sense, so it seems we could fix that by making sure we don't generate InlinedCallFrame for functions with stackalloc, right?

jkotas commented 5 years ago

we don't generate InlinedCallFrame for functions with stackalloc, right?

Interop marshaling stubs are using stackalloc and we have to be able to generate InlinedCallFrame in them.

Why is it not a problem on other platforms? Is it because of we are encoding the GSCookie offset relative to FP on other platforms?

janvorli commented 5 years ago

No, we always encode the GSCookie offset relative to the caller SP. The difference is in the unwinding. For ARM, frames with stackalloc start unwinding from the R9. Here is an example of the unwind info of the System.Globalization.CompareInfo:GetHashCodeOfStringCore:

Unwind Info:
  >> Start offset   : 0x000000 (not in unwind data)
  >>   End offset   : 0x000634 (not in unwind data)
  Code Words        : 3
  Epilog Count      : 1
  F bit             : 0
  E bit             : 0
  X bit             : 0
  Vers              : 0
  Function Length   : 794 (0x0031a) Actual length = 1588 (0x000634)
  ---- Epilog scopes ----
  ---- Scope 0
  Epilog Start Offset        : 645 (0x00285) Actual offset = 1290 (0x00050a) Offset from main function begin = 1290 (0x00050a)
  Condition                  : 14 (0xe) (always)
  Epilog Start Index         : 6 (0x06)
  ---- Unwind codes ----
    C9          mov sp, r9                          ; opsize 16
    27          add sp, sp, dotnet/runtime#3929                    ; opsize 16
    DF          pop {r4,r5,r6,r7,r8,r9,r10,r11,lr}  ; opsize 32
    EC 06       pop {r1,r2}                         ; opsize 16
    FF          end
    ---- Epilog start at index 6 ----
    C9          mov sp, r9                          ; opsize 16
    27          add sp, sp, dotnet/runtime#3929                    ; opsize 16
    DF          pop {r4,r5,r6,r7,r8,r9,r10,r11,lr}  ; opsize 32
    02          add sp, sp, dotnet/coreclr#8                      ; opsize 16
    FD          end + nop                           ; opsize 16
    FF          end

I don't have an example of the unwind code from the x64 version, but I guess it unwinds using RBP. @BruceForstall do you know why don't we unwind using frame pointer on ARM too?

BruceForstall commented 5 years ago

R9 is actually not used as a frame pointer here. The R11 is a frame pointer.

For the purposes of the ARM unwind codes, R9 is the frame pointer. At least in the sense of being used for unwinding. The unwind codes don't know anything about R11. We maintain an R11 chain just for ETW, I think. And we use either R9 or R11 for locals access, based on local variable offset from them. (Although we only use R11 in EH funclets.) It's possible we could only use R11 and not R9, but I think there are instruction encoding benefits to use positive offsets from R9 versus negative offsets from R11. Note that on ARM64, we never implemented this, so there are various comments like "// TODO-ARM64-CQ: with compLocallocUsed, should we use REG_SAVED_LOCALLOC_SP instead?" So for function with localloc on arm64, locals access almost always uses an extra register and instruction for offset calculation.

janvorli commented 5 years ago

The JitDump log for the System.Globalization.CompareInfo:GetHashCodeOfStringCore says in the gc info: Set stack base register to r11 That's why I've thought the r11 is the frame pointer here.

janvorli commented 5 years ago

Also, the InlinedCallFrame::m_pCalleeSavedFP contains the R11

janvorli commented 5 years ago

And the runtime assumes the m_pCalleeSavedFP is R11: https://github.com/dotnet/coreclr/blob/48ff0937552e540f21835391b693daf47ffabece/src/vm/arm/stubs.cpp#L2372-L2373

BruceForstall commented 5 years ago

Yes, all of those are true. I think it's only in the unwind codes that R9 is considered the frame pointer.

BruceForstall commented 5 years ago

In an offline conversation with @jkotas and @davidwrighton, I learned that we are using the JIT_PInvokeBegin/JIT_PInvokeEnd helpers more now than before, for R2R. Namely, the VM is passing CORJIT_FLAG_USE_PINVOKE_HELPERS more than before.

Previously, the JIT code would set up some data for the InlinedCallFrame in the prolog, and for ARM that meant m_pCallSiteSP was set up by the helper call to CORINFO_HELP_INIT_PINVOKE_FRAME. This value would be SP before any localloc, so on unwind, setting R9 to this would allow unwind to proceed.

Now, JIT_PInvokeBegin is called at the call site, and it does (from src\vm\arm\pinvokestubs.S):

str     r1, [r4, #InlinedCallFrame__m_pCallSiteSP]

so this value is after any localloc. Thus, it can't be used to restore R9.

@jkotas suggests, as you do above, that we should add another field in InlinedCallFrame to store R9 for ARM32. (Apparently there's an issue of R2R file format that needs care, as InlinedCallFrames are part of that format, but perhaps with some extra space already available for expansion?)

jkotas commented 5 years ago

InlinedCallFrames are part of that format, but perhaps with some extra space already available for expansion?

Yes, see comment in CEEInfo::getEEInfo

BruceForstall commented 5 years ago

In case anyone is curious... the reason we require R9 for unwinding on ARM, but no equivalent on ARM64, is due to a detail of the unwind codes. We need to unwind using a frame pointer since the stack pointer will vary with localloc. We need to establish the frame pointer after saving all callee-saved registers (so they can get their own unwind codes). We also want frame pointer chaining (FP points to saved FP). On ARM, we also want to use the "push mask" instruction to save LR, FP, etc., in one instruction. Thus, we would need to use an instruction like "add r11, sp, 0x50" to establish the frame pointer. But there is no ARM unwind code for this instruction. On ARM64, there is, so we can do that. So on ARM, we use R9 specifically so we can establish the frame pointer at the correct time, and also make the frame unwindable, if SP will change later.

BruceForstall commented 5 years ago

At first I thought there might be JIT work for the non-CORJIT_FLAG_USE_PINVOKE_HELPERS case, but that case would be handled by adding the saving of R9 in src\vm\arm\stubs.cpp, GenerateInitPInvokeFrameHelper().

And the other changes would also be in the VM, in the definition of InlinedCallFrame, and JIT_PInvokeBegin (Windows and non-Windows).

So I don't see any required JIT work here, except for perhaps adding some comments about R9 use in target.h and updating the comments in Lowering::InsertPInvokeMethodProlog().

dotnet / runtime

[Arm32/Linux] GS cookie corruption when running some corefx tests. #13000