Closed janvorli closed 4 years ago
cc: @BruceForstall
@wfurt Could you take a look at this?
I don't think this belongs to area-gc... @RussKeldorph which area should this goto?
@tarekgh Could someone in CoreFX look at this since it appears to be specific to this System.Globalization code, which is doing stackallocs and P/Invoking to native code, among other things?
@danmosemsft
@adamsitnik
I have taken a look at both methods:
In GetHashCodeOfStringCore
we allocate a block of memory of a const size on the stack:
pin it (the span can be also an array from ArrayPool) and then send it to a native method providing a proper length:
In GetAsciiCore
we allocate a block of memory of a variable size on the stack:
and also send it to a native method providing a proper length:
So it looks like from the managed code perspective we don't have any buffer overrun.
I am not an expert (I had to find out what GS Cookie really is) but is it not more a code-gen issue? Similar to https://github.com/dotnet/coreclr/pull/15087 ?
@RussKeldorph I believe this is something more related to the runtime and specific to arm-32 . This is not specific to Globalization tests only but it also Collections, and System.Data. I have changed the tag to code-gen which would be reasonable to have them take a look.
I'm out of office until 7/25 @RussKeldorph. It seems like @adamsitnik has good handle on this.
I tried to reproduce the GetAsciiCore
problem, but was unable to. Actually, when running the GetAscii_Success
test, it never even JITs GetAscii or GetAscii core, so it seems my build doesn't return anything in the "Where" clause here:
public void GetAscii_Success()
{
Assert.All(Factory.GetDataset().Where(e => e.ASCIIResult.Success), entry =>
@tarekgh @adamsitnik It appears the System.Globalization.IdnMapping.GetAsciiCore
api ends up calling a platform-specific install of libicuuc. Is this just expected to exist on the platform? This could presumably differ by architecture/OS, right? So potentially this could be a platform-specific native code bug in this external library?
fwiw, I also don't see GetHashCodeOfStringCore
JIT in runs of System.Collections.Specialized.Tests and System.Data.Common.Tests, so I'm not sure what I'm doing wrong here.
It appears the System.Globalization.IdnMapping.GetAsciiCore api ends up calling a platform-specific install of libicuuc.
The whole Globalization depends on ICU library. This is not different than other dependencies like openssl or crypto. Do you mean the native ICU library call can result in writing on the stack? this seams unlikely but I cannot tell for sure this is happening. I guess if this happen, we would see the problem in many other places as the IDNA code is used by the networking stack for instance. If it is possible we can break on the memory when the GS cookie got overwritten that can make it easier to know who is doing that. is this something can easily be done?
I am back from my vacation, so I can continue looking into the issue.
@tarekgh Hopefully Jan can make progress investigating this, but he said above:
but the functions with corrupted GS cookies are called many times before the issue reproes, so I cannot use something as simple as memory watchpoint to find who's corrupting the cookie.
so it might not be easy/possible to just set a data breakpoint on the GS cookie and see who modifies it.
In the tests identified above, a local buffer is allocated, pinned, and passed to the native code. If the code (or the p/invoke stub layer?) writes beyond the end of the allocated buffer, that's when the GS cookie would get overwritten. Note that there is only one GS cookie per frame (that requires one), and it's not precisely immediately following the "unsafe buffer" (e.g., 'stackalloc' buffer), so code can overwrite the buffer "a little" and not necessarily hit the GS cookie (but still corrupt part of the frame).
I agree that it seems unlikely that the native code has an issue, but the fact that a platform-specific bug might be due to platform-specific component (and even OS distribution specific component?) makes it more likely in my mind.
but the fact that a platform-specific bug might be due to platform-specific component (and even OS distribution specific component?) makes it more likely in my mind.
That is possible. I hope @janvorli will have news soon.
cc @jkoritzinsky @JeremyKuhne in case one of them sees any issue with the interop being done here.
I have found the culprit. First I have found that the cookie location that we compute is in the middle of the stackalloc-ed buffer in both the System.Globalization.IdnMapping.GetAsciiCore
and System.Globalization.CompareInfo.GetHashCodeOfStringCore
cases.
The GS cookie offset is decoded in EECodeManager::GetGSCookieAddr
relative to the caller SP. For methods with stackalloc, the R9 is used to save the SP value at the end of the prolog. That way, the unwinder can compute the caller SP based on the R9.
When we start stack walking at InlinedCallFrame
though, we don't have the R9 stored in it and when we extract the REGDISPLAY
from the InlinedCallFrame
in InlinedCallFrame::UpdateRegDisplay
, we set R9 to the InlinedCallFrame::m_pCallSiteSP
, the same value we set the SP to. That value though is the SP at the call site of the pinvoke, which is a wrong value for R9 for functions with stackalloc, as the SP was already updated by the stackalloc at that point.
We then use unwinder to get the caller's SP as a base for getting the GS cookie address. The unwinder starts from the wrong R9 and so it obtains wrong caller's SP.
It seems that we will need to add a new field and save the R9 to the InlinedCallFrame
for arm32 and update the arm32 version of InlinedCallFrame::UpdateRegDisplay
accordingly.
@jkotas this is the issue I've mentioned to you today.
@janvorli There is a comment in the JIT in lower.cpp, InsertPInvokeMethodProlog(), that says:
non-x86: method prolog (SP remains
constant in function, after prolog: no
localloc and PInvoke in same function)
So apparently we don't expect to see InlinedCallFrame when there is also localloc?
The use of R9 shouldn't be hard-coded by the VM; it is passed in the unwind codes to specify it as the frame pointer.
R9 is actually not used as a frame pointer here. The R11 is a frame pointer. The R9 just saves the SP after prolog and based on the comments in the source, it is hardcoded. In JIT, it is represented by REG_SAVED_LOCALLOC_SP define. The code that extracts REGDISPLAY from the InlinedCallFrame also mentiones it: https://github.com/dotnet/coreclr/blob/48ff0937552e540f21835391b693daf47ffabece/src/vm/arm/stubs.cpp#L2375-L2378
We need to make sure target.h in the JIT gets a similar comment about keeping REG_SAVED_LOCALLOC_SP
in sync with the VM, to point at src/vm/arm/stubs.cpp.
So apparently we don't expect to see InlinedCallFrame when there is also localloc
Ah, that would make sense, so it seems we could fix that by making sure we don't generate InlinedCallFrame for functions with stackalloc, right?
we don't generate InlinedCallFrame for functions with stackalloc, right?
Interop marshaling stubs are using stackalloc and we have to be able to generate InlinedCallFrame in them.
Why is it not a problem on other platforms? Is it because of we are encoding the GSCookie offset relative to FP on other platforms?
No, we always encode the GSCookie offset relative to the caller SP. The difference is in the unwinding. For ARM, frames with stackalloc start unwinding from the R9. Here is an example of the unwind info of the System.Globalization.CompareInfo:GetHashCodeOfStringCore:
Unwind Info:
>> Start offset : 0x000000 (not in unwind data)
>> End offset : 0x000634 (not in unwind data)
Code Words : 3
Epilog Count : 1
F bit : 0
E bit : 0
X bit : 0
Vers : 0
Function Length : 794 (0x0031a) Actual length = 1588 (0x000634)
---- Epilog scopes ----
---- Scope 0
Epilog Start Offset : 645 (0x00285) Actual offset = 1290 (0x00050a) Offset from main function begin = 1290 (0x00050a)
Condition : 14 (0xe) (always)
Epilog Start Index : 6 (0x06)
---- Unwind codes ----
C9 mov sp, r9 ; opsize 16
27 add sp, sp, dotnet/runtime#3929 ; opsize 16
DF pop {r4,r5,r6,r7,r8,r9,r10,r11,lr} ; opsize 32
EC 06 pop {r1,r2} ; opsize 16
FF end
---- Epilog start at index 6 ----
C9 mov sp, r9 ; opsize 16
27 add sp, sp, dotnet/runtime#3929 ; opsize 16
DF pop {r4,r5,r6,r7,r8,r9,r10,r11,lr} ; opsize 32
02 add sp, sp, dotnet/coreclr#8 ; opsize 16
FD end + nop ; opsize 16
FF end
I don't have an example of the unwind code from the x64 version, but I guess it unwinds using RBP. @BruceForstall do you know why don't we unwind using frame pointer on ARM too?
R9 is actually not used as a frame pointer here. The R11 is a frame pointer.
For the purposes of the ARM unwind codes, R9 is the frame pointer. At least in the sense of being used for unwinding. The unwind codes don't know anything about R11. We maintain an R11 chain just for ETW, I think. And we use either R9 or R11 for locals access, based on local variable offset from them. (Although we only use R11 in EH funclets.) It's possible we could only use R11 and not R9, but I think there are instruction encoding benefits to use positive offsets from R9 versus negative offsets from R11. Note that on ARM64, we never implemented this, so there are various comments like "// TODO-ARM64-CQ: with compLocallocUsed, should we use REG_SAVED_LOCALLOC_SP instead?" So for function with localloc on arm64, locals access almost always uses an extra register and instruction for offset calculation.
The JitDump log for the System.Globalization.CompareInfo:GetHashCodeOfStringCore says in the gc info:
Set stack base register to r11
That's why I've thought the r11 is the frame pointer here.
Also, the InlinedCallFrame::m_pCalleeSavedFP contains the R11
And the runtime assumes the m_pCalleeSavedFP is R11: https://github.com/dotnet/coreclr/blob/48ff0937552e540f21835391b693daf47ffabece/src/vm/arm/stubs.cpp#L2372-L2373
Yes, all of those are true. I think it's only in the unwind codes that R9 is considered the frame pointer.
In an offline conversation with @jkotas and @davidwrighton, I learned that we are using the JIT_PInvokeBegin/JIT_PInvokeEnd
helpers more now than before, for R2R. Namely, the VM is passing CORJIT_FLAG_USE_PINVOKE_HELPERS
more than before.
Previously, the JIT code would set up some data for the InlinedCallFrame in the prolog, and for ARM that meant m_pCallSiteSP was set up by the helper call to CORINFO_HELP_INIT_PINVOKE_FRAME
. This value would be SP before any localloc, so on unwind, setting R9 to this would allow unwind to proceed.
Now, JIT_PInvokeBegin
is called at the call site, and it does (from src\vm\arm\pinvokestubs.S):
str r1, [r4, #InlinedCallFrame__m_pCallSiteSP]
so this value is after any localloc. Thus, it can't be used to restore R9.
@jkotas suggests, as you do above, that we should add another field in InlinedCallFrame to store R9 for ARM32. (Apparently there's an issue of R2R file format that needs care, as InlinedCallFrames are part of that format, but perhaps with some extra space already available for expansion?)
InlinedCallFrames are part of that format, but perhaps with some extra space already available for expansion?
Yes, see comment in CEEInfo::getEEInfo
In case anyone is curious... the reason we require R9 for unwinding on ARM, but no equivalent on ARM64, is due to a detail of the unwind codes. We need to unwind using a frame pointer since the stack pointer will vary with localloc. We need to establish the frame pointer after saving all callee-saved registers (so they can get their own unwind codes). We also want frame pointer chaining (FP points to saved FP). On ARM, we also want to use the "push mask" instruction to save LR, FP, etc., in one instruction. Thus, we would need to use an instruction like "add r11, sp, 0x50" to establish the frame pointer. But there is no ARM unwind code for this instruction. On ARM64, there is, so we can do that. So on ARM, we use R9 specifically so we can establish the frame pointer at the correct time, and also make the frame unwindable, if SP will change later.
At first I thought there might be JIT work for the non-CORJIT_FLAG_USE_PINVOKE_HELPERS
case, but that case would be handled by adding the saving of R9 in src\vm\arm\stubs.cpp, GenerateInitPInvokeFrameHelper()
.
And the other changes would also be in the VM, in the definition of InlinedCallFrame
, and JIT_PInvokeBegin
(Windows and non-Windows).
So I don't see any required JIT work here, except for perhaps adding some comments about R9 use in target.h and updating the comments in Lowering::InsertPInvokeMethodProlog()
.
While running release build of corefx tests on checked build of coreclr on ARM32 (tested on my RPi3 with Raspbian), I have found that couple of corefx test suites fail due to GS cookie corruption detected at GC stack walk time. This happens:
The issue reproes in 80..100% runs of the test suites. I was trying to debug both cases, but the functions with corrupted GS cookies are called many times before the issue reproes, so I cannot use something as simple as memory watchpoint to find who's corrupting the cookie.
Unfortunately, LLDB / sos plugin on this platform are quite unstable together, so e.g. the clrstack sos command kills LLDB. At least the ip2md works so that I can see what's on the managed stack.
Here is a call stack of the thread with the System.Globalization.IdnMapping.GetAsciiCore on the stack when another thread runs GC and finds the corrupted cookie:
Disassembling the System.Globalization.IdnMapping.GetAsciiCore, I can see that the GS cookie location matches what the stack walker expects. But instead of having 0x12345678 in the cookie, there is a "random" value at the point of failure.
The same is true for the System.Globalization.CompareInfo.GetHashCodeOfStringCore.
The failure in both of the test suites and the stack traces (at least the frame with corrupted GS cookie and all other frames towards the TOS) is always the same.