[arm32] Assert failure: ExecutionManager::IsManagedCode(adrRedirectedIP)

BruceForstall commented 6 years ago

Consistent failure in Windows arm32 corefx test System.Runtime.Tests with COMPlus_JitStress=2:

https://ci.dot.net/job/dotnet_coreclr/job/master/view/arm/job/jitstress/job/arm_cross_checked_windows_nt_corefx_jitstress2_tst/10/consoleText

Running: C:\Users\robox\j\workspace\arm_cross_che---6fedd1e9\_\fx\bin\tests\System.Runtime.Tests\netcoreapp-Windows_NT-Release-arm\RunTests.cmd C:\Users\robox\j\workspace\arm_cross_che---6fedd1e9\_\fx\bin\testhost\netcoreapp-Windows_NT-Release-arm
Using C:\Users\robox\j\workspace\arm_cross_che---6fedd1e9\_\fx\bin\testhost\netcoreapp-Windows_NT-Release-arm as the test runtime folder.
Executing in C:\Users\robox\j\workspace\arm_cross_che---6fedd1e9\_\fx\bin\tests\System.Runtime.Tests\netcoreapp-Windows_NT-Release-arm\ 
----- start  0:15:29.22 ===============  To repro directly: ===================================================== 
pushd C:\Users\robox\j\workspace\arm_cross_che---6fedd1e9\_\fx\bin\tests\System.Runtime.Tests\netcoreapp-Windows_NT-Release-arm\
set COMPlus_JitStress=2
call C:\Users\robox\j\workspace\arm_cross_che---6fedd1e9\_\fx\bin\testhost\netcoreapp-Windows_NT-Release-arm\dotnet.exe xunit.console.netcore.exe System.Runtime.Tests.dll  -xml testResults.xml -notrait Benchmark=true -notrait category=nonnetcoreapptests -notrait category=nonwindowstests  -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing
popd
===========================================================================================================

C:\Users\robox\j\workspace\arm_cross_che---6fedd1e9\_\fx\bin\tests\System.Runtime.Tests\netcoreapp-Windows_NT-Release-arm>set COMPlus_JitStress=2 

C:\Users\robox\j\workspace\arm_cross_che---6fedd1e9\_\fx\bin\tests\System.Runtime.Tests\netcoreapp-Windows_NT-Release-arm>call C:\Users\robox\j\workspace\arm_cross_che---6fedd1e9\_\fx\bin\testhost\netcoreapp-Windows_NT-Release-arm\dotnet.exe xunit.console.netcore.exe System.Runtime.Tests.dll  -xml testResults.xml -notrait Benchmark=true -notrait category=nonnetcoreapptests -notrait category=nonwindowstests  -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing 
xUnit.net console test runner (32-bit .NET Core)
Copyright (C) 2014 Outercurve Foundation.

Discovering: System.Runtime.Tests
Discovered:  System.Runtime.Tests
Starting:    System.Runtime.Tests

Assert failure(PID 6980 [0x00001b44], Thread: 16036 [0x3ea4]): ExecutionManager::IsManagedCode(adrRedirectedIP)

<no module>! <no symbol> + 0x0 (0x00000000)
    File: d:\j\workspace\arm_cross_che---b006c545\src\vm\exceptionhandling.cpp Line: 5594
    Image: C:\Users\robox\j\workspace\arm_cross_che---6fedd1e9\_\fx\bin\testhost\netcoreapp-Windows_NT-Release-arm\dotnet.exe

----- end  0:16:09.48 ----- exit code 123456789 ----------------------------------------------------------
COREFX TEST FAILED

sandreenko commented 6 years ago

The issue is not stable, it happens during different tests every time.

BruceForstall commented 6 years ago

Also JitStress=1:

https://ci.dot.net/job/dotnet_coreclr/job/master/view/arm/job/jitstress/job/arm_cross_checked_windows_nt_corefx_jitstress1_tst/14/consoleText

sandreenko commented 6 years ago

~~As we found with @jkotas there are several requirements for this issue:~~ ~~1. complus_TailcallStress=1;~~ ~~2. caller that calls callee with TailCallHelperStub;~~ ~~3. callee throws a hardware exception;~~ ~~3. another thread that calls GC.Collect()~~~~

~~so if GC tries to stop the first thread when it unwinds the stack for the exception then the issue happens.~~

~~I have tried to do a small repro, but it have not hit the issue.~~

The previous description was wrong.

sandreenko commented 6 years ago

So now we understand what is happening in this issue, dotnet/coreclr#17920 adds IL test that doesn't require stress mode to hit this. C# example is here.

To hit this issue we need:

An interface call foo() that creates Dispatch Stub;
call foo() with this == null, to cause AV in the dispatch stub;
tail call foo(), to create an unmanaged frame on the top of the stack.

We create arm dispatch stub in src\vm\arm\stubs.cpp https://github.com/dotnet/coreclr/blob/6f0bb947138c6f75a1721fef7f6c54d4b01282dc/src/vm/arm/stubs.cpp#L1005

and this stub checks that the current method table is equal to the cached one, it expects this to be null and has such comment: https://github.com/dotnet/coreclr/blob/6f0bb947138c6f75a1721fef7f6c54d4b01282dc/src/vm/arm/stubs.cpp#L1027-L1032

But the VM's personality routine can't handle this AV if the first frame on the stack is not from managed code. It happens when we do tail call optimization and the frame on the stack is JIT_TailCallHelperStub_ReturnAddress.

Another issue is that Windows arm unwind stack subs -2 for all addresses on the stack to get instruction pointer that was before the next instruction.

So if before the AV we had such stack:

[0] CLRStub[VSD_DispatchStub]@fffffffefffffffe:
[1] CoreCLR!Jit_TailCallHelperStub_returnAddress

after AV we have:

[5] coreclr!NakedThrowHelper
[6] coreclr!TailCallHelperStub <- it point to CoreCLR!Jit_TailCallHelperStub_returnAddress - 2 and shows incorrent frame.

maybe we should add a nop to the beggining of CoreCLR!Jit_TailCallHelperStub_returnAddress to handle this.

The issue repros with 'arm32_legacy_jit' and the comment about expected null was added in 2010. So it is not a regression from 2.0.

I can't find a fix on Jit side that won't produce big asm diffs and will be risk free for 2.1.

@erozenfeld could you please check that I did not forget anything important?

jkotas commented 6 years ago

I can't find a fix on Jit side that won't produce big asm diffs and will be risk free for 2.1.

Have you considered changing this https://github.com/dotnet/coreclr/blob/master/src/jit/morph.cpp#L8050 to #if defined(_TARGET_X86_) || defined(_ARM_) ? I should only produce diff around slow VSD tailcalls that are very rare.

sandreenko commented 6 years ago

Have you considered changing this https://github.com/dotnet/coreclr/blob/master/src/jit/morph.cpp#L8050 to #if defined(_TARGETX86) || defined(ARM) ? I should only produce diff around slow VSD tailcalls that are very rare.

on jit side we can:

disable optimized VSD for tail calls;
do not do tailcall optimization if it is not an explicit IL prefix, if it is - disable VSD.
add an explicit null check before each VSD taillcall;

I will collect diffs for the third option.

sandreenko commented 6 years ago

Have you considered changing this https://github.com/dotnet/coreclr/blob/master/src/jit/morph.cpp#L8050 to #if defined(_TARGETX86) || defined(ARM) ? I should only produce diff around slow VSD tailcalls that are very rare.

That part is only for elif defined(_TARGET_XARCH_) && !defined(LEGACY_BACKEND), for arm this logic is https://github.com/dotnet/coreclr/blob/6f0bb947138c6f75a1721fef7f6c54d4b01282dc/src/jit/morph.cpp#L7794-L7800

BruceForstall commented 6 years ago

@sandreenko It looks like System.Runtime.Tests is still excluded in the arm\corefx_test_exclusions.txt file, against this issue that is now fixed. Can you please remove the exclusion?

e.g.:

System.Runtime.Tests                 # https://github.com/dotnet/coreclr/issues/17585

dotnet / runtime

[arm32] Assert failure: ExecutionManager::IsManagedCode(adrRedirectedIP) #10162