Closed BruceForstall closed 4 years ago
The issue is not stable, it happens during different tests every time.
As we found with @jkotas there are several requirements for this issue:
1.
complus_TailcallStress=1
;2.
caller
that calls callee
with TailCallHelperStub
;3.
callee
throws a hardware exception;3. another thread that calls ~~GC.Collect()
so if GC tries to stop the first thread when it unwinds the stack for the exception then the issue happens.
I have tried to do a small repro, but it have not hit the issue.
The previous description was wrong.
So now we understand what is happening in this issue, dotnet/coreclr#17920 adds IL test that doesn't require stress mode to hit this. C# example is here.
To hit this issue we need:
foo()
that creates Dispatch Stub;foo()
with this == null
, to cause AV in the dispatch stub;foo()
, to create an unmanaged frame on the top of the stack.We create arm dispatch stub in src\vm\arm\stubs.cpp
https://github.com/dotnet/coreclr/blob/6f0bb947138c6f75a1721fef7f6c54d4b01282dc/src/vm/arm/stubs.cpp#L1005
and this stub checks that the current method table is equal to the cached one, it expects this
to be null
and has such comment:
https://github.com/dotnet/coreclr/blob/6f0bb947138c6f75a1721fef7f6c54d4b01282dc/src/vm/arm/stubs.cpp#L1027-L1032
But the VM's personality routine
can't handle this AV if the first frame on the stack is not from managed code. It happens when we do tail call optimization and the frame on the stack is JIT_TailCallHelperStub_ReturnAddress
.
Another issue is that Windows arm unwind stack subs -2 for all addresses on the stack to get instruction pointer that was before the next instruction.
So if before the AV we had such stack:
[0] CLRStub[VSD_DispatchStub]@fffffffefffffffe:
[1] CoreCLR!Jit_TailCallHelperStub_returnAddress
after AV we have:
[5] coreclr!NakedThrowHelper
[6] coreclr!TailCallHelperStub <- it point to CoreCLR!Jit_TailCallHelperStub_returnAddress - 2 and shows incorrent frame.
maybe we should add a nop to the beggining of CoreCLR!Jit_TailCallHelperStub_returnAddress
to handle this.
The issue repros with 'arm32_legacy_jit' and the comment about expected null
was added in 2010.
So it is not a regression from 2.0.
I can't find a fix on Jit side that won't produce big asm diffs and will be risk free for 2.1.
@erozenfeld could you please check that I did not forget anything important?
I can't find a fix on Jit side that won't produce big asm diffs and will be risk free for 2.1.
Have you considered changing this https://github.com/dotnet/coreclr/blob/master/src/jit/morph.cpp#L8050 to #if defined(_TARGET_X86_) || defined(_ARM_)
? I should only produce diff around slow VSD tailcalls that are very rare.
Have you considered changing this https://github.com/dotnet/coreclr/blob/master/src/jit/morph.cpp#L8050 to #if defined(_TARGETX86) || defined(ARM) ? I should only produce diff around slow VSD tailcalls that are very rare.
on jit side we can:
I will collect diffs for the third option.
Have you considered changing this https://github.com/dotnet/coreclr/blob/master/src/jit/morph.cpp#L8050 to #if defined(_TARGETX86) || defined(ARM) ? I should only produce diff around slow VSD tailcalls that are very rare.
That part is only for elif defined(_TARGET_XARCH_) && !defined(LEGACY_BACKEND)
, for arm this logic is https://github.com/dotnet/coreclr/blob/6f0bb947138c6f75a1721fef7f6c54d4b01282dc/src/jit/morph.cpp#L7794-L7800
@sandreenko It looks like System.Runtime.Tests is still excluded in the arm\corefx_test_exclusions.txt
file, against this issue that is now fixed. Can you please remove the exclusion?
e.g.:
System.Runtime.Tests # https://github.com/dotnet/coreclr/issues/17585
Consistent failure in Windows arm32 corefx test System.Runtime.Tests with COMPlus_JitStress=2:
https://ci.dot.net/job/dotnet_coreclr/job/master/view/arm/job/jitstress/job/arm_cross_checked_windows_nt_corefx_jitstress2_tst/10/consoleText