Unit Test Fail: 143837 ARM/Linux (Cannot Handle NullReference Exception)

dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.

https://docs.microsoft.com/dotnet/core/

MIT License

15.2k stars 4.72k forks source link

Unit Test Fail: 143837 ARM/Linux (Cannot Handle NullReference Exception) #5817

Closed myungjoo closed 4 years ago

myungjoo commented 8 years ago

The CoreCLR Unittest of JIT/jit64/gc/regress/vswhidbey/143837 fails at ARM/Linux:

bash-3.2# gdb --args ./corerun 143837.exe 
...
(gdb) run
...

Program received signal SIGSEGV, Segmentation fault.
0xb650c974 in JIT_WriteBarrier () at patchedcode.S:413
413 patchedcode.S: No such file or directory.
(gdb) bt
#0  0xb650c974 in JIT_WriteBarrier () at patchedcode.S:413
dotnet/coreclr#1  0xb1cf095c in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

I'll update as soon as some useful data comes out.

myungjoo commented 8 years ago

When corerun dies, it is at

  0xb650c970 <JIT_WriteBarrier>           dmb    sy
 >0xb650c974 <JIT_WriteBarrier+4>         str    r1, [r0, #0]
  0xb650c976 <JIT_WriteBarrier+6>         movw   r3, dotnet/coreclr#12
  0xb650c97a <JIT_WriteBarrier+10>        movt   r3, #45680      ; 0xb270

where

(gdb) info registers
r0             0x4      4
r1             0xb270da6c       2993740396
r2             0xb370235c       3010470748
r3             0xb650c971       3058747761
r4             0xb1cf0039       2983133241
r5             0xbefff310       3204444944
r6             0xbefff1f0       3204444656
r7             0xbeffecf8       3204443384
r8             0xb6d2c000       3067265024
r9             0x70470200       1883701760
r10            0xb6d2c000       3067265024
r11            0xbeffecc0       3204443328
r12            0xb1cf0901       2983135489
sp             0xbeffec68       0xbeffec68
lr             0xb1cf095d       -1311831715
pc             0xb650c974       0xb650c974 <JIT_WriteBarrier+4>
cpsr           0x20030  131120

Did JIT-code just give us a pointer of 0x04 to be referenced? ~~(my first rough guess is on a case where JIT codegen wrote movt/movw after assigning r0 when it should have done so before assigning r0.)~~

The JIT code creating the trouble:

   0xb1cf0900:  stmdb   sp!, {r4, r10, r11, lr}
   0xb1cf0904:  sub sp, dotnet/coreclr#80 ; 0x50
   0xb1cf0906:  add.w   r11, sp, dotnet/runtime#3896    ; 0x58
   0xb1cf090a:  sub.w   r2, r11, dotnet/runtime#3893    ; 0x54
   0xb1cf090e:  movs    r3, dotnet/coreclr#8
   0xb1cf0910:  movs    r0, #0
   0xb1cf0912:  movs    r1, #0
   0xb1cf0914:  stmia   r2!, {r0, r1}
   0xb1cf0916:  subs    r3, dotnet/coreclr#1
   0xb1cf0918:  bhi.n   0xb1cf0914
   0xb1cf091a:  str r0, [r2, #0]
   0xb1cf091c:  add r0, sp, dotnet/coreclr#96 ; 0x60
   0xb1cf091e:  str r0, [sp, dotnet/runtime#3888]   ; 0x4c
   0xb1cf0920:  movw    r3, #36684  ; 0x8f4c
   0xb1cf0924:  movt    r3, #46482  ; 0xb592
   0xb1cf0928:  ldr r3, [r3, #0]
   0xb1cf092a:  cmp r3, #0
   0xb1cf092c:  beq.n   0xb1cf0938
   0xb1cf092e:  movw    r3, #55165  ; 0xd77d
   0xb1cf0932:  movt    r3, #46660  ; 0xb644
   0xb1cf0936:  blx r3
   0xb1cf0938:  nop
   0xb1cf093a:  nop
   0xb1cf093c:  movw    r3, dotnet/coreclr#2936   ; 0xb78
   0xb1cf0940:  movt    r3, #45936  ; 0xb370
   0xb1cf0944:  ldr r3, [r3, #0]
   0xb1cf0946:  movw    r2, dotnet/coreclr#9052   ; 0x235c
   0xb1cf094a:  movt    r2, #45936  ; 0xb370
   0xb1cf094e:  ldr r1, [r2, #0]
   0xb1cf0950:  adds    r0, r3, dotnet/coreclr#4
   0xb1cf0952:  movw    r3, #51569  ; 0xc971
   0xb1cf0956:  movt    r3, #46672  ; 0xb650
   0xb1cf095a:  blx r3
=> 0xb1cf095c:  nop
(gdb) p/x *(0xb3700b78)
$12 = 0x0
(gdb)

where r3 heads to JIT_WriteBarrier.

myungjoo commented 8 years ago

~~Looks like JIT is using uninitialized pointer value (0xb3700b78 in the above instance).~~ (it was just failing GDB watchpoint. someone is setting the value as 0)

(note: x64/Linux/Debug succeeds)

myungjoo commented 8 years ago

Happens in the JIT code generated by the context of:

#0  CodeGen::genGCWriteBarrier (this=0x6b034, tgt=0x6d218, wbf=GCInfo::WBF_BarrierUnchecked) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/codegencommon.cpp:3854
dotnet/coreclr#1  0xb6965936 in CodeGen::WriteBarrier (this=0x6b034, tgt=0x6d218, assignVal=0x72e00, tgtAddrReg=8) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/codegenlegacy.cpp:3453
dotnet/coreclr#2  0xb6971840 in CodeGen::genCodeForTreeSmpOpAsg (this=0x6b034, tree=0x6d290) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/codegenlegacy.cpp:12092
dotnet/coreclr#3  0xb696ed2e in CodeGen::genCodeForTreeSmpOp (this=0x6b034, tree=0x6d290, destReg=22527, bestReg=0) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/codegenlegacy.cpp:9545
dotnet/coreclr#4  0xb695d428 in CodeGen::genCodeForTree (this=0x6b034, tree=0x6d290, destReg=22527, bestReg=0) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/codegenlegacy.cpp:12547
dotnet/coreclr#5  0xb6977cb2 in CodeGen::genCodeForBBlist (this=0x6b034) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/codegenlegacy.cpp:13101
dotnet/coreclr#6  0xb6994262 in CodeGen::genGenerateCode (this=0x6b034, codePtr=0xbeffd480, nativeSizeOfCode=0xbeffd830) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/codegencommon.cpp:3026
dotnet/coreclr#7  0xb69a43c0 in Compiler::compCompile (this=0x697f8, methodCodePtr=0xbeffd480, methodCodeSize=0xbeffd830, compileFlags=0xbeffd484)
    at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/compiler.cpp:3911
dotnet/coreclr#8  0xb69a621a in Compiler::compCompileHelper (this=0x697f8, classPtr=0xb5928a34, compHnd=0xbeffd930, methodInfo=0xbeffd88c, methodCodePtr=0xbeffd480, methodCodeSize=0xbeffd830,
    compileFlags=0xbeffd484, instVerInfo=INSTVER_GENERIC_PASSED_VERIFICATION) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/compiler.cpp:4911
dotnet/coreclr#9  0xb69a51a4 in Compiler::compCompile(CORINFO_METHOD_STRUCT_*, CORINFO_MODULE_STRUCT_*, ICorJitInfo*, CORINFO_METHOD_INFO*, void**, unsigned int*, CORJIT_FLAGS*)::$_0::operator()(Compiler::compCompi
le(CORINFO_METHOD_STRUCT_*, CORINFO_MODULE_STRUCT_*, ICorJitInfo*, CORINFO_METHOD_INFO*, void**, unsigned int*, CORJIT_FLAGS*)::__JITParam*) const (this=0xbeffd19c, __JITpParam=0xbeffd1a0)
    at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/compiler.cpp:4306
dotnet/coreclr#10 0xb69a504c in Compiler::compCompile (this=0x697f8, methodHnd=0xb5929910, classPtr=0xb5928a34, compHnd=0xbeffd930, methodInfo=0xbeffd88c, methodCodePtr=0xbeffd480, methodCodeSize=0xbeffd830,
    compileFlags=0xbeffd484) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/compiler.cpp:4328
dotnet/coreclr#11 0xb69ad544 in jitNativeCode(CORINFO_METHOD_STRUCT_*, CORINFO_MODULE_STRUCT_*, ICorJitInfo*, CORINFO_METHOD_INFO*, void**, unsigned int*, CORJIT_FLAGS*, void*)::$_3::operator()(jitNativeCode(CORINF
O_METHOD_STRUCT_*, CORINFO_MODULE_STRUCT_*, ICorJitInfo*, CORINFO_METHOD_INFO*, void**, unsigned int*, CORJIT_FLAGS*, void*)::__JITParam*) const::{lambda(jitNativeCode(CORINFO_METHOD_STRUCT_*, CORINFO
_MODULE_STRUCT_*, ICorJitInfo*, CORINFO_METHOD_INFO*, void**, unsigned int*, CORJIT_FLAGS*, void*)::$_3::operator()(jitNativeCode(CORINFO_METHOD_STRUCT_*, CORINFO_MODULE_STRUCT_*, ICorJitInfo*, CORINF
O_METHOD_INFO*, void**, unsigned int*, CORJIT_FLAGS*, void*)::__JITParam*) const::__JITParam*)#1}::operator()(jitNativeCode(CORINFO_METHOD_STRUCT_*, CORINFO_MODULE_STRUCT_*, ICorJitInfo*, CORINFO_METH
OD_INFO*, void**, unsigned int*, CORJIT_FLAGS*, void*)::$_3::operator()(jitNativeCode(CORINFO_METHOD_STRUCT_*, CORINFO_MODULE_STRUCT_*, ICorJitInfo*, CORINFO_METHOD_INFO*, void**, unsigned int*, CORJI
T_FLAGS*, void*)::__JITParam*) const::__JITParam*) const (this=0xbeffd288, __JITpParam=0xbeffd28c) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/compiler.cpp:5522
dotnet/coreclr#12 0xb69a71f6 in jitNativeCode(CORINFO_METHOD_STRUCT_*, CORINFO_MODULE_STRUCT_*, ICorJitInfo*, CORINFO_METHOD_INFO*, void**, unsigned int*, CORJIT_FLAGS*, void*)::$_3::operator()(jitNativeCode(CORINF
O_METHOD_STRUCT_*, CORINFO_MODULE_STRUCT_*, ICorJitInfo*, CORINFO_METHOD_INFO*, void**, unsigned int*, CORJIT_FLAGS*, void*)::__JITParam*) const (this=0xbeffd350, __JITpParam=0xbeffd358)
    at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/compiler.cpp:5551
dotnet/coreclr#13 0xb69a7052 in jitNativeCode (methodHnd=0xb5929910, classPtr=0xb5928a34, compHnd=0xbeffd930, methodInfo=0xbeffd88c, methodCodePtr=0xbeffd480, methodCodeSize=0xbeffd830, compileFlags=0xbeffd484,
    inlineInfoPtr=0x0) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/compiler.cpp:5553
dotnet/coreclr#14 0xb6872af0 in CILJit::compileMethod (this=0xb6d4d834 <operator new(unsigned int, CILJitSingletonAllocator const&)::CILJitBuff>, compHnd=0xbeffd930, methodInfo=0xbeffd88c, flags=1048596,
    entryAddress=0xbeffd834, nativeSizeOfCode=0xbeffd830) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/ee_il_dll.cpp:286
dotnet/coreclr#15 0xb632695e in invokeCompileMethodHelper (jitMgr=0x39a38, comp=0xbeffd930, info=0xbeffd88c, flags=1048596, flags2=0, nativeEntry=0xbeffd834, nativeSizeOfCode=0xbeffd830)
    at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/jitinterface.cpp:11667
dotnet/coreclr#16 0xb6326a2a in invokeCompileMethod (jitMgr=0x39a38, comp=0xbeffd930, info=0xbeffd88c, flags=1048596, flags2=0, nativeEntry=0xbeffd834, nativeSizeOfCode=0xbeffd830)
    at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/jitinterface.cpp:11722
dotnet/coreclr#17 0xb6326c4e in CallCompileMethodWithSEHWrapper(EEJitManager*, CEEInfo*, CORINFO_METHOD_INFO*, unsigned int, unsigned int, unsigned char**, unsigned int*, MethodDesc*)::$_4::operator()(CallCompileMe
thodWithSEHWrapper(EEJitManager*, CEEInfo*, CORINFO_METHOD_INFO*, unsigned int, unsigned int, unsigned char**, unsigned int*, MethodDesc*)::Param*) const (this=0xbeffd62c, pParam=0xbeffd630)
    at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/jitinterface.cpp:11784
dotnet/coreclr#18 0xb6326b8c in CallCompileMethodWithSEHWrapper (jitMgr=0x39a38, comp=0xbeffd930, info=0xbeffd88c, flags=1048596, flags2=0, nativeEntry=0xbeffd834, nativeSizeOfCode=0xbeffd830, ftn=0xb5929910)
    at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/jitinterface.cpp:11827
...

(224th invokation of genGCWriteBarrier generates 0xb1cf0952-095a above)

myungjoo commented 8 years ago

Looks like JIT/ARM gen of "lea" in this instance is fishy.

myungjoo commented 8 years ago

The code in question, making r0 ( = r3 + 4) to access 0x04,

   0xb1cf093c:  movw    r3, dotnet/coreclr#2936   ; 0xb78
   0xb1cf0940:  movt    r3, #45936  ; 0xb370
   0xb1cf0944:  ldr r3, [r3, #0]

(resulting r3 = 0)

is JIT-created by getEmitter()->emitIns_R_C() called by inst_RV_TT() called by genCodeForTreeLeaf() called by if (kind & GTK_LEAF) in CodeGen::genCodeForTree() Generating code for tree [000010] destReg = 0xb6d27000 bestReg = 0x0 called by CodeGen::genMakeIndAddrMode() called by CodeGen::genMakeAddressable() in codegenlegacy.cpp.

The invalid indirect pointer 0xb3700b78(*0xb5700b78 = 0) is retrived at addr = (ssize_t)emitComp->info.compCompHnd->getFieldAddress(fldHnd, NULL); in emitIns_R_C().

myungjoo commented 8 years ago

The code above is the first JIT-created code that hits 0xb3700b78. (Added printf for all instruction emit functions for JIT debugging) although someone is calling getFieldAddress resulting in 0xb3700b78 a bit before using it, no one writes there.

It looks as if JIT forgot to setup static objects.

myungjoo commented 8 years ago

With a simpler test case that has the same effect on CoreCLR, it appears to be related with NullReference handling. The following simplified test case incurs the same JIT code that commands WriteBarrier to load 0x04.

bash-3.2#  ./corerun 143837.mod.exe 
Starting 1...
GOT the exception 1
Starting 2...
Segmentation fault (core dumped)
bash-3.2# cat 143837.mod.cs
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.
//

using System;

public class Test
{
    public static int Main()
    {
        TestClass TC = null;

        Console.WriteLine("Starting 1...");
        try
        {
          Console.WriteLine(TC.strField);
        } catch (NullReferenceException e)
        {
          Console.WriteLine("GOT the exception 1");
        }
        Console.WriteLine("Starting 2...");
        try
        {
          TC.strField = "let's go";
          TC.abcd = TC.abc = 0;
        } catch (NullReferenceException e)
        {
          Console.WriteLine("GOT the exception 2");
        }
        return 100;
    }

};

public class TestClass
{
    public string strField;
    public TestClass(String strIn)
    {
        strField = strIn;
    }
    public virtual string strProperty
    {
        get
        {
            return strField;
        }
        set
        {
            strField = value;
        }
    }
}
bash-3.2#

myungjoo commented 8 years ago

Anyway, regardless of the strField's offset in the TestClass, r0 is trying to load 0x04 (r3 + 4).

myungjoo commented 8 years ago

Rebased with d0e1bcc4d881cb3757cf335a99a64b34d86d8e62 and the point of segmentation fault has changed. (probably, pointing at the same logical location, but, need more analysis..)

myungjoo commented 8 years ago

The currently segmentation-faulting JIT-generated code with the example code above is: (segmentation faults after Starting 2...)

   0xb1ad3832:  movs    r0, r0
   0xb1ad3834:  movs    r0, r0
   0xb1ad3836:  movs    r0, r0
   0xb1ad3838:  push    {r2, r6, lr}
   0xb1ad383a:  cbz r6, 0xb1ad3868
   0xb1ad383c:  stmdb   sp!, {r4, r10, r11, lr}
   0xb1ad3840:  sub sp, dotnet/runtime#3868 ; 0x28
   0xb1ad3842:  add.w   r11, sp, dotnet/coreclr#48    ; 0x30
   0xb1ad3846:  sub.w   r2, r11, dotnet/runtime#3870    ; 0x2c
   0xb1ad384a:  movs    r0, #0
   0xb1ad384c:  movs    r1, #0
   0xb1ad384e:  stmia   r2!, {r0, r1}
   0xb1ad3850:  stmia   r2!, {r0, r1}
   0xb1ad3852:  stmia   r2!, {r0, r1}
   0xb1ad3854:  str r0, [r2, #0]
   0xb1ad3856:  add r0, sp, dotnet/coreclr#56 ; 0x38
   0xb1ad3858:  str r0, [sp, dotnet/coreclr#36]   ; 0x24
   0xb1ad385a:  movw    r3, dotnet/coreclr#16220  ; 0x3f5c
   0xb1ad385e:  movt    r3, #46482  ; 0xb592
   0xb1ad3862:  ldr r3, [r3, #0]
   0xb1ad3864:  cmp r3, #0
   0xb1ad3866:  beq.n   0xb1ad3872
   0xb1ad3868:  movw    r3, #37605  ; 0x92e5
   0xb1ad386c:  movt    r3, #46660  ; 0xb644
   0xb1ad3870:  blx r3
   0xb1ad3872:  nop
   0xb1ad3874:  movs    r3, #0
   0xb1ad3876:  str r3, [sp, dotnet/coreclr#20]
   0xb1ad3878:  movw    r3, dotnet/coreclr#9024   ; 0x2340
   0xb1ad387c:  movt    r3, #45904  ; 0xb350
   0xb1ad3880:  ldr r0, [r3, #0]
   0xb1ad3882:  movw    r3, dotnet/coreclr#12761  ; 0x31d9
   0xb1ad3886:  movt    r3, #45485  ; 0xb1ad
   0xb1ad388a:  blx r3
   0xb1ad388c:  nop
   0xb1ad388e:  nop
   0xb1ad3890:  ldr r3, [sp, dotnet/coreclr#20]
=> 0xb1ad3892:  ldr r0, [r3, dotnet/coreclr#4]
   0xb1ad3894:  movw    r3, dotnet/coreclr#12761  ; 0x31d9
   0xb1ad3898:  movt    r3, #45485  ; 0xb1ad
   0xb1ad389c:  blx r3
   0xb1ad389e:  nop
   0xb1ad38b8:  blx r3
(gdb) bt  
#0  0xb1ad3892 in ?? ()
dotnet/coreclr#1  0xb1ad388c in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) info registers
r0             0x0  0
r1             0x0  0
r2             0xbeffec98   3204443288
r3             0x0  0
r4             0xb1ad3039   2980917305
r5             0xbefff300   3204444928
r6             0xbefff1e0   3204444640
r7             0xbeffece8   3204443368
r8             0xb6d2b000   3067260928
r9             0xef87ed00   4018662656
r10            0xb6d2b000   3067260928
r11            0xbeffecd8   3204443352
r12            0xb6d2b3b0   3067261872
sp             0xbeffeca8   0xbeffeca8
lr             0xb1ad388d   -1314047859
pc             0xb1ad3892   0xb1ad3892
cpsr           0x600e0030   1611530288
(gdb)

The JIT-generated code is storing 0 at sp+20 and try to load +4 of that stored sp+20 (=0), which inevitably causes segmentation fault. This does not happen with x64.

janvorli commented 8 years ago

@myungjoo I assume you are seeing the problem without gdb too, right? Since null reference exception is a hardware exception (SIGSEGV) that gets caught by gdb first and then if you let it continue, it will pass it to CoreCLR for handling. The JIT_WriteBarrier is one of a few places out of managed code where the SIGSEGV is converted to a managed exception (see IsIPInMarkedJitHelper function here: https://github.com/dotnet/coreclr/blob/master/src/vm/excep.cpp#L7279)

myungjoo commented 8 years ago

@janvorli Yes, I'm seeing segmentation fault without GDB as well. Anyway, if CoreCLR catches SIGSEGV of JITted code, I might be seeing something that shouldn't matter. I'll keep that fact in mind. Thanks.

janvorli commented 8 years ago

That's how null references are detected (and also division by zero, etc.). In your case, could you try to set a breakpoint at HandleHardwareException and see if it gets called? It should happen in your case (after you hit the SIGSEGV first in the GDB and do "c", it should go there).

myungjoo commented 8 years ago

It is because at SIGSEGV, JIT_WriteBarrier is not unwindable and VirtualUnwindToFirstManagedCallFrame called by DispatchedManagedEception fails to unwind up to the managed frame (one step deeper than JIT_WriteBarrier).

A fix being tested. It appears that the whole asmhelpers.S may need a review/fix for the possibility of exception/signal handling. (other assembly functions that I've assume that they are not calling anyone else should be reconsidered.)

myungjoo commented 8 years ago

OK. the fix (adding PROLOG_PUSH, PROLOG_STACK_STAVE to the write barriers) is allowing to handle the secondary exceptions of 143837. However, it exposes another "unhandled exception" in CoreCLR. The fix is going to be bigger than simply adding assembly macros.

Starting 1...
GOT the exception 1
Starting 2...
GOT the exception 2

Assert failure(PID 15489 [0x00003c81], Thread: 15489 [0x3c81]): Consistency check failed: FAILED: state.fFound
    File: /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/exceptionhandling.cpp Line: 363
    Image: /sdcard/unittest/corerun

**** MessageBox invoked, title 'corerun - Assert Failure (PID 15489, Thread 15489/0x3c81)' ****
  Consistency check failed: FAILED: state.fFound

/source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/exceptionhandling.cpp, Line: 363

Abort - Kill program
Retry - Debug
Ignore - Keep running

Image:
/sdcard/unittest/corerun

********

Aborted (core dumped)
bash-3.2# cat 143837.mod.cs 
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.
//

using System;

public class Test
{
    public static int Main()
    {
        TestClass TC = null;

        Console.WriteLine("Starting 1...");
        try
        {
          Console.WriteLine(TC.strField);
        } catch (NullReferenceException e)
        {
          Console.WriteLine("GOT the exception 1");
        }
        Console.WriteLine("Starting 2...");
        try
        {
          TC.strField = "let's go";
          TC.abcd = TC.abc = 0;
        } catch (NullReferenceException e)
        {
          Console.WriteLine("GOT the exception 2");
        }
        return 100;
    }

};

public class TestClass
{
    public int abc;
    public int abcd;
    public string strField;
    public TestClass(String strIn)
    {
        strField = strIn;
    }
    public virtual string strProperty
    {
        get
        {
            return strField;
        }
        set
        {
            strField = value;
        }
    }
}
bash-3.2#

(the original 143837 shows the same behavior)

janvorli commented 8 years ago

@myungjoo I wonder why the write barrier is not unwindable in its original form. The write barrier is a very performance sensitive thing and adding instructions just to make it unwindable seems unfortunate. Does it mean that the ARM unwinder is unable to unwind a simple function with no stack usage?

myungjoo commented 8 years ago

@janvorli Yes, PAL_VirtualUnwind() behaves erratic at JIT_WriteBarrier (PAL_VirtualUwind's return value = 0 / ContextPC = 0, which denotes that it has hit the bottom of the stack.)

If WriteBarrier is very performance sensitive and is called too frequently by JITted code, then, we need to embed the code block into the JIT code (inlining or bx/b), not making JIT to call (blx/bl) it, especially when it may generate SIGSEGV that requires unwinding.

ps. Testing some rough ideas that may allow libunwind to unwind without actually saving anything to the stack..

janvorli commented 8 years ago

@myungjoo I think I can see why it is unable to unwind the function. It seems the problem is that the JIT_WriteBarrier is defined as follows:

    LEAF_ENTRY JIT_WriteBarrier, _TEXT
    .space (0x84)
    LEAF_END_MARKED JIT_WriteBarrier, _TEXT

So when the assembler compiles it, it doesn't see any code and so it cannot create proper unwind data. Maybe we should put in just fake "bx lr" before the .space (and update the 0x84 accordingly). It would get overwritten by the real barrier code at runtime, but that code should have the same unwind semantics as "bx lr"

myungjoo commented 8 years ago

I thought similar things and tried to add the following in the .space() area of patchedcode.S:

.pad #0

.setfp sp, sp

bx lr

and all of them or two of them

along with .space() realignment (bx lr in thumb = 0x2)

ALL fails with PAL_VirtualUnwind.

Next try: .unwind_raw

janvorli commented 8 years ago

Could it be a bug in the unwinder? It seems it should be able to unwind through simple bx lr, since it is as simple as setting the PC to LR.

myungjoo commented 8 years ago

GLIBC seems to do some tricks for Linux/ARM to let unwind work while not saving registers into the stack. I'll try to follow GLIBC's sysdeps/unix/sysv/linux/arm/sigrestorer.S.

Anyway, this doesn't seem to be a bug of libunwind because it tries to decode the exidx entry, which doesn't exist with the current code or is incorrect with my trials without actually saving things in the stack.

janvorli commented 8 years ago

If we fail to make it unwind the barrier in a reasonable way, we have one fallback option left. We can wrap the PAL_VirtualUnwind in a wrapper function that would first check if the IP is in the JIT_WriteBarrier range and if it is, unwind it just by setting the PC to the LR without calling into the PAL. We have a function to do such unwind that we use for the managed code - Thread::VirtualUnwindLeafCallFrame and that you can use. Or now that I think about it, the only place where we will need to unwind the JIT_WriteBarrier is the Thread::VirtualUnwindToFirstManagedCallFrame and it could only be the first frame. So we can add this piece of code ifdefed for ARM right before the loop:

if (IsIPInMarkedJitHelper(uControlPC))
{
    return VirtualUnwindLeafCallFrame(pContext);
}

janvorli commented 8 years ago

@myungjoo I am sorry, it should not be return, but rather this:

if (IsIPInMarkedJitHelper(uControlPC))
{
    uControlPC = VirtualUnwindLeafCallFrame(pContext);
}

myungjoo commented 8 years ago

I'll try to follow GLIBC's sysdeps/unix/sysv/linux/arm/sigrestorer.S.

Nah.. that doesn't work. It (using cfi) actually saves registers into the stack.

Unless we have JIT_WriteBarrier (and its neighbors) embedded into the JIT-code or let the JIT-code to use b/bx, or sacrifice the performance (use of push {r7, lr}), the workaround @janvorli suggested might be the answer.

if (IsIPInMarkedJitHelper(uControlPC))
{
    uControlPC = VirtualUnwindLeafCallFrame(pContext);
}

I'll try to make an exception in exception handling for JIT_WriteBarrier, JIT_CheckedWriteBarrier, JIT_ByRefWriteBarrier as @janvorli has suggested.

myungjoo commented 8 years ago

Anyway, that expression should be in the loop, not before the loop. JIT_WriteBarrier is the last native frame (the closest to the first managed), not the first native frame (the farthest from the first managed).

Or what about putting this in PAL_VirtualUnwind()'s heading to prevent calling libunwind-arm for JIT_WriteBarrier?

janvorli commented 8 years ago

I have forgotten that when handling hardware exceptions, we have to unwind all the way back from the DispatchManagedException through the signal trampoline before we get to the actual failing function. So you are right.

myungjoo commented 8 years ago

mzx@kohaku:/source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm$ git diff stackwalk.cpp
diff --git a/src/vm/stackwalk.cpp b/src/vm/stackwalk.cpp
index 3c2dfd7..2e3b876 100644
--- a/src/vm/stackwalk.cpp
+++ b/src/vm/stackwalk.cpp
@@ -791,6 +791,16 @@ UINT_PTR Thread::VirtualUnwindToFirstManagedCallFrame(T_CONTEXT* pContext)
         {
             break;
         }
+#ifdef _ARM_
+        if (IsIPInMarkedJitHelper(uControlPc))
+        {
+        // JIT_WriteBarrir does not support libunwind.
+            printf("%s:%d from PC %8.8x\n", __func__, __LINE__, uControlPc);
+            uControlPc = VirtualUnwindCallFrame(pContext);
+            printf("%s:%d to PC %8.8x\n", __func__, __LINE__, uControlPc);
+            return TRUE;
+        }
+#endif // _ARM_
 #endif // !FEATURE_PAL
     }

Starting 1...
GOT the exception 1
Starting 2...
VirtualUnwindToFirstManagedCallFrame:798 from PC b64e47b4
Segmentation fault (core dumped)
bash-3.2#

0xb64e47b4 is the JIT_WriteBarrier and VirtualUnwindCallFrame cannot unwind it.

GDB says:

VirtualUnwindToFirstManagedCallFrame:798 from PC b65097b4

Program received signal SIGSEGV, Segmentation fault.
0xb6326e1c in EECodeInfo::GetFunctionEntry (this=0xbeffdca0) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/jitinterface.cpp:13711
13711   #define kRBP    5
(gdb) bt
#0  0xb6326e1c in EECodeInfo::GetFunctionEntry (this=0xbeffdca0) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/jitinterface.cpp:13711
dotnet/coreclr#1  0xb637186a in Thread::VirtualUnwindCallFrame (pContext=0xbeffdda0, pContextPointers=0x0, pCodeInfo=0x0) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/stackwalk.cpp:611
dotnet/coreclr#2  0xb6371a98 in Thread::VirtualUnwindToFirstManagedCallFrame (pContext=0xbeffdda0) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/stackwalk.cpp:799
dotnet/coreclr#3  0xb64f6244 in DispatchManagedException (ex=...) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/exceptionhandling.cpp:4651
dotnet/coreclr#4  0xb64ee0e0 in HandleHardwareException (ex=0xbeffe3e0) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/exceptionhandling.cpp:5118
dotnet/coreclr#5  0xb6a8e8b2 in SEHProcessException (pointers=0xbeffe670) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/seh.cpp:243
dotnet/coreclr#6  0xb6a904ea in common_signal_handler (code=11, siginfo=0xbeffe938, sigcontext=0xbeffe9b8, numParams=2) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/signal.cpp:597
dotnet/coreclr#7  0xb6a8f90c in sigsegv_handler (code=11, siginfo=0xbeffe938, context=0xbeffe9b8) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/signal.cpp:266
dotnet/coreclr#8  <signal handler called>
dotnet/coreclr#9  0xb65097b4 in JIT_WriteBarrier () at patchedcode.S:413
dotnet/coreclr#10 0xb1cf08d6 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) p/x m_pJM
$1 = 0x0

Looks like I need to make another unwinding function for JIT Helpers. Anyway, I'll continue this issue tomorrow. :)

janvorli commented 8 years ago

You have used wrong function :-). It was 'VirtualUnwindLeafCallFrame' in my example.

myungjoo commented 8 years ago

VirtualUnwindLeafCallFrame does not work because the context has no valid LR value:

VirtualUnwindToFirstManagedCallFrame:798 from PC b64273b4
VirtualUnwindToFirstManagedCallFrame:800 to PC b64273b4

(b64273b4 is in JIT_WriteBarrier and its 'parent' is shown as if itself because LR = PC, which is "intentional" in libunwind-arm; it does not rely on LR value of cursor, which is accessible with unw_get_reg())

The resulting Context after calling VirtualUnwindLeafCallFrame is:

$8 = {ContextFlags = 0x20000b, R0 = 0xbeffdda0, R1 = 0xa766f300, R2 = 0xa766f300, R3 = 0xa766f300, R4 = 0xbeffe938, R5 = 0xb, R6 = 0xa766f300, R7 = 0xbeffe37c, R8 = 0xb6d2c000, R9 = 0xa766f300, 
  R10 = 0xb6d2c000, R11 = 0xbeffecd8, R12 = 0x0, Sp = 0xbeffdf50, Lr = 0xb64efcd9, Pc = 0xb64efcd8, 
...

Even the first context given to VirtualUnwindToFirstManagedCallFrame has LR invalid (LR register is not going to be recovered after the first bl/blx. It stays in the stack.):

Breakpoint 4, Thread::VirtualUnwindToFirstManagedCallFrame (pContext=0xafe42ffc) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/stackwalk.cpp:763
763 {
(gdb) n
772     PCODE uControlPc = GetIP(pContext);
(gdb) p/x *pContext
$9 = {ContextFlags = 0x20000b, R0 = 0xbeffdda0, R1 = 0x29c3d700, R2 = 0x29c3d700, R3 = 0x29c3d700, R4 = 0xbeffe938, R5 = 0xb, R6 = 0x29c3d700, R7 = 0xbeffdf48, R8 = 0xb6d2c000, R9 = 0x29c3d700, 
  R10 = 0xb6d2c000, R11 = 0xbeffecd8, R12 = 0x0, Sp = 0xbeffdd08, Lr = 0xb64f7df9, Pc = 0xb64f7df9,
...
(gdb) bt
#0  Thread::VirtualUnwindToFirstManagedCallFrame (pContext=0xbeffdda0) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/stackwalk.cpp:772
dotnet/coreclr#1  0xb64f7e3c in DispatchManagedException (ex=...) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/exceptionhandling.cpp:4651
dotnet/coreclr#2  0xb64efcd8 in HandleHardwareException (ex=0xbeffe3e0) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/exceptionhandling.cpp:5118
dotnet/coreclr#3  0xb6a8fde2 in SEHProcessException (pointers=0xbeffe670) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/seh.cpp:243
dotnet/coreclr#4  0xb6a91a1a in common_signal_handler (code=11, siginfo=0xbeffe938, sigcontext=0xbeffe9b8, numParams=2) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/signal.cpp:597
dotnet/coreclr#5  0xb6a90e3c in sigsegv_handler (code=11, siginfo=0xbeffe938, context=0xbeffe9b8) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/signal.cpp:266
dotnet/coreclr#6  <signal handler called>
dotnet/coreclr#7  0xb1adb892 in ?? ()
dotnet/coreclr#8  0xb1adb88c in ?? ()

myungjoo commented 8 years ago

Normally, functions without calls (bl/blx) like JIT_WriteBarrier may keep lr valid without saving registers into the stack.

However, with an hardware exception (SIGSEGV in this case), the exception handler (frame 6 in the above case) saves and restores all (r0 - r15). So, without proper unwinding information saving at JIT_WriteBarrier, we need to read information (lr) saved by #6 <signal handler called> for JIT_WriteBarrir #7.

Libunwind-arm does not support such backtracking the backtracks. CoreCLR does not either. We seem to let the loop unwind-the-unwind when we meet JITHelper Functions. (need to study glibc for a while. assembly of <signal handler called> is in a very unfamiliar format)

myungjoo commented 8 years ago

Ah.. this seems to be yet another issue from libunwind-arm. I can't sure if libunwind-arm's unw_handle_signal_frame, which is supposed to unwind a signal frame, work or not, but it is not going there (unw_is_signal_frame is not working) with #6 <signal handler called>.

The op code to be compared if it is a signal handler from corerun + glibc is 0x07adf04f (thumb/32b mov.w r7 0xad), which is not recognized by libunwind-arm's Gis_signal_frame.c preprocessor definitions.

I'll verify if correcting libunwind-arm works or not.

I don't sure if this is going to let libunwind-arm unwind JIT_WriteBarrier because this is going to make its psuedo-callee correctly unwind-able.
(Now, we are going to have three libunwind patches pending for libunwind mainline.)
(I may need to reconstruct the CoreCLR unwinding-loop to reuse resulting unw_cursor, not window-context)

current:

1. CCLR-context -> unw_cursor
2. libunwind (unw_cursor)
3. unw_cursor->CCLR-context
4. goto 1 until X

to be

1. CCLR-context -> unw_cursor
2. libunwind (unw_cursor)
3. goto 2 until X
4. unw_cursor->CCLR-context

ps. there is a report that this is related with other failed test cases.

janvorli commented 8 years ago

@myungjoo if the problem is related to the problem crossing the signal trampoline, I think we could make a change that I was considering for some time. If we used the exception context in the DispatchManagedException for the case of hardware exceptions instead of starting at the current frame, then we would not need to cross the trampoline at all. Could you please try the following change to make it work that way?

Add bool parameter "isHardwareException" to the DispatchManagedException
Pass "true" as that parameter to the call to it in HandleHardwareException
Pass "false" as that parameter to the call to it in the UNINSTALL_MANAGED_EXCEPTION_DISPATCHER macro definition.
Add isHardwareException = false; inside the catch
Change the code in the "try" block in the DispatchManagedException to this:

            // Unwind the context to the first managed frame
            CONTEXT frameContext;
            if (isHardwareException)
            {
                frameContext = ex.ContextRecord;
                frameContext.ContextFlags |= CONTEXT_EXCEPTION_ACTIVE;
                if (IsIPInMarkedJitHelper(GetIP(&frameContext)))
                {
                    PAL_VirtualUnwind(&frameContext, NULL);
                }
            }
            else
            {
                RtlCaptureContext(&frameContext);
                UINT_PTR currentSP = GetSP(&frameContext);

                if (Thread::VirtualUnwindToFirstManagedCallFrame(&frameContext) == 0)
                {
                    // There are no managed frames on the stack, so we need to continue unwinding using C++ exception
                    // handling
                    break;
                }

                UINT_PTR firstManagedFrameSP = GetSP(&frameContext);

                // Check if there is any exception holder in the skipped frames. If there is one, we need to unwind them
                // using the C++ handling. This is a special case when the UNINSTALL_MANAGED_EXCEPTION_DISPATCHER was
                // not at the managed to native boundary.
                if (NativeExceptionHolderBase::FindNextHolder(nullptr, (void*)currentSP, (void*)firstManagedFrameSP) != nullptr)
                {
                    break;
                }
            }

If it works as I expect, we can have it that way for all Unix platforms and save few unwind calls. I would then clean it up and create a PR for it.

myungjoo commented 8 years ago

@jkotas I have modified libunwind to correctly identify hardware exception handler and modified PAL_VirtualUnwind to keep calling unw_step() reusing unw_cursor_t cursor if the cursor is at <signal handler called> or at JIT_WriteBarrier.

Then, it successfully passes JIT_WriteBarrier and goes to the JIT stack.

So, the problem in CoreCLR-ARM is that we are losing some context in cursor by converting back-and-forth between libunwind-unw_cursor_t and coreclr-CONTEXT.

How about including unw_cursor_t in CONTEXT struct if it is Linux && ARM and use the unw_cursor_t if it is "valid (not null)"? It removes the need for conversions inside a unwind loop as well.

The (the condition is not correct. it works only for the unit test 143837 and demonstration only) tested patch is:

diff --git a/src/pal/src/exception/seh-unwind.cpp b/src/pal/src/exception/seh-unwind.cpp
index f6dd1d9..89ecd44 100644
--- a/src/pal/src/exception/seh-unwind.cpp
+++ b/src/pal/src/exception/seh-unwind.cpp
@@ -289,7 +289,22 @@ BOOL PAL_VirtualUnwind(CONTEXT *context, KNONVOLATILE_CONTEXT_POINTERS *contextP
     // in the context flags.
     if (unw_is_signal_frame(&cursor) > 0)
     {
-        context->ContextFlags |= CONTEXT_EXCEPTION_ACTIVE;
+        printf("%s:%d SKIPPING signal frame (double unwind)  PC\n", __func__, __LINE__);
+        st = unw_step(&cursor); // unwind again to skip this.
+    UnwindContextToWinContext(&cursor, context);
+        printf("Skipped to %8.8x PC\n", (unsigned int) CONTEXTGetPC(context));
+        st = unw_step(&cursor); // unwind again to skip this.
+    UnwindContextToWinContext(&cursor, context);
+        printf("Skipped again to %8.8x PC\n", (unsigned int) CONTEXTGetPC(context));
+        if (st < 0)
+        {
+            return FALSE;
+        }
+        if (unw_is_signal_frame(&cursor) > 0) {
+            context->ContextFlags |= CONTEXT_EXCEPTION_ACTIVE;
+        } else {
+            context->ContextFlags &= ~CONTEXT_EXCEPTION_ACTIVE;
+        }
     }
     else
     {

Or.. with less modification on the logic (reusing unw_cursor_t only when converting seems not working), we may make: BOOL PAL_VirtualUnwind(CONTEXT *context, KNONVOLATILE_CONTEXT_POINTERS *contextPointers, int repeat = 0) and let PAL_VirtualUnwind call unw_step() repeat + 1 times. Then, at the loop of VirtualUnwindToFirstManagedCallFrame, we may workaround it with help from IsIPInMarkedJitHelper() (once it is detected to be MarkedJitHelper, backtrack the backtracking of the context and double-unwind.)

I'll test this method too.

myungjoo commented 8 years ago

OK. allowing PAL_VirtualUnwind to unwind repeatedly (PAL_VirtualUnwind(context, NULL, repeat_number);) solves the issue. I'll start commiting after more testing:

How stackwalk.cpp changes:

diff --git a/src/vm/stackwalk.cpp b/src/vm/stackwalk.cpp
index 3c2dfd7..55f2f89 100644
--- a/src/vm/stackwalk.cpp
+++ b/src/vm/stackwalk.cpp
@@ -770,6 +770,9 @@ UINT_PTR Thread::VirtualUnwindToFirstManagedCallFrame(T_CONTEXT* pContext)
     CONTRACTL_END;

     PCODE uControlPc = GetIP(pContext);
+#ifdef _ARM_
+    T_CONTEXT pContextCallee;
+#endif

     // unwind out of this function and out of our caller to
     // get our caller's PSP, or our caller's caller's SP.
@@ -778,7 +781,10 @@ UINT_PTR Thread::VirtualUnwindToFirstManagedCallFrame(T_CONTEXT* pContext)
 #ifndef FEATURE_PAL
         uControlPc = VirtualUnwindCallFrame(pContext);
 #else // !FEATURE_PAL
-        BOOL success = PAL_VirtualUnwind(pContext, NULL);
+#ifdef _ARM_
+        pContextCallee = *pContext;
+#endif
+        BOOL success = PAL_VirtualUnwind(pContext, NULL, 0);
         if (!success)
         {
             _ASSERTE(!"Thread::VirtualUnwindToFirstManagedCallFrame: PAL_VirtualUnwind failed");
@@ -791,8 +797,35 @@ UINT_PTR Thread::VirtualUnwindToFirstManagedCallFrame(T_CONTEXT* pContext)
         {
             break;
         }
+#ifdef _ARM_
+        if (IsIPInMarkedJitHelper(uControlPc))
+        {
+            // JIT_WriteBarrir does not support libunwind or VirtualUnwindLeafCallFrame
+           // It did not save unwind information while we got deeper by signal handler
+           // We need to unwind manually from its callee. damn...
+            *pContext = pContextCallee;
+            printf("%s:%d un-unwind PC %8.8x \n", __func__, __LINE__, uControlPc);
+           uControlPc = GetIP(pContext);
+           printf("to %8.8x \n", uControlPc);
+
+            //uControlPc = VirtualUnwindLeafCallFrame(pContext);
+
+           BOOL success = PAL_VirtualUnwind(pContext, NULL, 1);
+            if (!success)
+            {
+                _ASSERTE(!"Thread::VirtualUnwindToFirstManagedCallFrame: PAL_VirtualUnwind failed");
+                EEPOLICY_HANDLE_FATAL_ERROR(COR_E_EXECUTIONENGINE);
+            }
+
+           uControlPc = GetIP(pContext);
+
+           // Unwind based on the CFI information of GLIBC / sigresgorer
+            printf("%s:%d to PC %8.8x\n", __func__, __LINE__, uControlPc);
+        }
+#endif // _ARM_
 #endif // !FEATURE_PAL
     }
+    printf("%s:%d Returning PC %8.8x\n", __func__, __LINE__, uControlPc);

     return uControlPc;
 }
diff --git a/src/pal/src/exception/seh-unwind.cpp b/src/pal/src/exception/seh-unwind.cpp
index f6dd1d9..987f83e 100644
--- a/src/pal/src/exception/seh-unwind.cpp
+++ b/src/pal/src/exception/seh-unwind.cpp
@@ -223,7 +223,7 @@ static void GetContextPointers(unw_cursor_t *cursor, unw_context_t *unwContext,
 #endif
 }

-BOOL PAL_VirtualUnwind(CONTEXT *context, KNONVOLATILE_CONTEXT_POINTERS *contextPointers)
+BOOL PAL_VirtualUnwind(CONTEXT *context, KNONVOLATILE_CONTEXT_POINTERS *contextPointers, int repeat)
 {
     int st;
     unw_context_t unwContext;
@@ -278,18 +278,39 @@ BOOL PAL_VirtualUnwind(CONTEXT *context, KNONVOLATILE_CONTEXT_POINTERS *contextP
     curPc = CONTEXTGetPC(context);
 #endif

-    st = unw_step(&cursor);
-    if (st < 0)
-    {
-        return FALSE;
-    }
+    do {
+        st = unw_step(&cursor);
+        if (st < 0)
+        {
+            return FALSE;
+        }
+    } while ((repeat--) > 0);

Not that out of GDB, it still gets segmentation faults due to excessive printf calls.

myungjoo commented 8 years ago

Hmm.. even without all clutters (printfs), it does not work without GDB (sometimes it works and returns 100)

bash-3.2# ./corerun 143837.exe 
Segmentation fault (core dumped)
bash-3.2# ./corerun 143837.exe 
caught expected exception System.NullReferenceException
caught expected exception System.NullReferenceException
caught expected exception System.NullReferenceException
caught expected exception System.NullReferenceException
caught expected exception System.NullReferenceException
caught expected exception System.NullReferenceException
Pass
bash-3.2# echo $?
100
bash-3.2# ./corerun 143837.exe 
Segmentation fault (core dumped)
bash-3.2# ./corerun 143837.exe 
Segmentation fault (core dumped)
bash-3.2# 
bash-3.2# ./corerun 143837.exe 
Segmentation fault (core dumped)
bash-3.2# ./corerun 143837.exe 
Segmentation fault (core dumped)

In GDB:

(gdb) run
Starting program: /sdcard/unittest/corerun 143837.exe
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/libthread_db.so.1".
[New Thread 0xb6152450 (LWP 8212)]
[New Thread 0xb58f2450 (LWP 8213)]
[New Thread 0xb4eff450 (LWP 8214)]
[New Thread 0xb24fe450 (LWP 8215)]

Program received signal SIGSEGV, Segmentation fault.
0xb650b0d4 in JIT_WriteBarrier () at patchedcode.S:413
413 patchedcode.S: No such file or directory.
(gdb) 
(gdb) c
Continuing.
caught expected exception System.NullReferenceException

Program received signal SIGSEGV, Segmentation fault.
0xb1adbb0e in ?? ()
(gdb) c
Continuing.
caught expected exception System.NullReferenceException

Program received signal SIGSEGV, Segmentation fault.
0xb1adba76 in ?? ()
(gdb) c
Continuing.
caught expected exception System.NullReferenceException

Program received signal SIGSEGV, Segmentation fault.
0xb650b0d4 in JIT_WriteBarrier () at patchedcode.S:413
413 in patchedcode.S
(gdb) c
Continuing.
caught expected exception System.NullReferenceException

Program received signal SIGSEGV, Segmentation fault.
0xb1adeb96 in ?? ()
(gdb) c
Continuing.
caught expected exception System.NullReferenceException

Program received signal SIGSEGV, Segmentation fault.
0xb1adeafe in ?? ()
(gdb) c
Continuing.
caught expected exception System.NullReferenceException
Pass
[Thread 0xb24fe450 (LWP 8215) exited]
[Thread 0xb4eff450 (LWP 8214) exited]
[Thread 0xb58f2450 (LWP 8213) exited]
[Thread 0xb6152450 (LWP 8212) exited]
[Inferior 1 (process 8205) exited with code 0144]
(gdb) c
The program is not being run.
(gdb)

myungjoo commented 8 years ago

The current guess on "Working-with-GDB. Not-working-without-GDB" is that we might be using asynch-signal UNsafe functions in the context triggered by SIGSEGV from JIT_WriteBarrier.

janvorli commented 8 years ago

@myungjoo Could you please try the solution I have suggested? You can give it a quick try first, just by adding this to the beginning of the "try" block of the DispatchManagedException after the CONTEXT frameContext:

                frameContext = ex.ContextRecord;
                if (IsIPInMarkedJitHelper(GetIP(&frameContext)))
                {
                    frameContext.ContextFlags |= CONTEXT_EXCEPTION_ACTIVE;
                    PAL_VirtualUnwind(&frameContext, NULL);
                    <<<< Set a breakpoint here >>>>
                }

When it hits the breakpoint, see if the frameContext contains a correctly unwound context.

janvorli commented 8 years ago

I also have a question - in case of the hardware exception - is the LR register value in the ex.ContextRecord valid?

myungjoo commented 8 years ago

@janvorli Your suggestion of https://github.com/dotnet/coreclr/issues/4859#issuecomment-219689649 (https://github.com/myungjoo/coreclr/commit/f9bd93d1c126c880712c24b66211850c270f49dc) is working as mine (https://github.com/myungjoo/coreclr/commit/3591fd67fd78a258cc79bf1f0fb27a7125df1497), even the unwanted effect is the same.

It mostly fails without GDB and it (always) passes with GDB.

bash-3.2# ./corerun 143837.exe 
Segmentation fault (core dumped)
bash-3.2# ./corerun 143837.mod.exe 
Starting 1...
GOT the exception 1
Starting 2...
Segmentation fault (core dumped)
bash-3.2# ./corerun 143837.exe 
caught expected exception System.NullReferenceException
caught expected exception System.NullReferenceException
caught expected exception System.NullReferenceException
caught expected exception System.NullReferenceException
caught expected exception System.NullReferenceException
caught expected exception System.NullReferenceException
Pass
bash-3.2# ./corerun 143837.exe 
Segmentation fault (core dumped)
bash-3.2#

And it shows the correct Lr:

(gdb) break src/vm/exceptionhandling.cpp:4648
Breakpoint 1 at 0xb64f7e88: file /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/exceptionhandling.cpp, line 4648.
(gdb) c
Continuing.

Breakpoint 1, DispatchManagedException (ex=..., isHardwareException=true) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/exceptionhandling.cpp:4648
4648                frameContext = ex.ContextRecord;
(gdb) n
4649            frameContext.ContextFlags |= CONTEXT_EXCEPTION_ACTIVE;
(gdb) p/x frameContext 
$1 = {ContextFlags = 0x200007, R0 = 0x4, R1 = 0xb250d86c, R2 = 0xbeffec98, R3 = 0xb650b311, R4 = 0xb1adb039, R5 = 0xbefff300, R6 = 0xbefff1e0, R7 = 0xbeffece8, R8 = 0xb6d2c000, R9 = 0xefca5100, 
  R10 = 0xb6d2c000, R11 = 0xbeffecd8, R12 = 0xb6d2c3b0, Sp = 0xbeffeca8, Lr = 0xb1adb8d7, Pc = 0xb650b314, Cpsr = 0xb0030, Fpscr = 0x80000010, Padding = 0xbeffea78, {Q = {{Low = 0x20007200650076, 
        High = 0x63006e00450038}, {Low = 0x6e00690064006f, High = 0x650047002e0067}, {Low = 0x74007900420074, High = 0x20005d00730065}, {Low = 0x65007000780045, High = 0x64006500740063}, {Low = 0x0, 
        High = 0x0}, {Low = 0x0, High = 0x0}, {Low = 0x0, High = 0x0}, {Low = 0x0, High = 0x0}, {Low = 0x4032efc92935df92, High = 0x4059000000000000}, {Low = 0x3ff4cccccccccccd, High = 0x0}, {
        Low = 0x0, High = 0x0}, {Low = 0x0, High = 0x0}, {Low = 0x0, High = 0x0}, {Low = 0x0, High = 0x0}, {Low = 0x0, High = 0x0}, {Low = 0x0, High = 0x0}}, D = {0x20007200650076, 0x63006e00450038, 
      0x6e00690064006f, 0x650047002e0067, 0x74007900420074, 0x20005d00730065, 0x65007000780045, 0x64006500740063, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x4032efc92935df92, 0x4059000000000000, 
      0x3ff4cccccccccccd, 0x0 <repeats 13 times>}, S = {0x650076, 0x200072, 0x450038, 0x63006e, 0x64006f, 0x6e0069, 0x2e0067, 0x650047, 0x420074, 0x740079, 0x730065, 0x20005d, 0x780045, 0x650070, 
      0x740063, 0x640065, 0x0 <repeats 16 times>}}, Bvr = {0xbeffeab0, 0xbefff42c, 0x1, 0xbeffea78, 0x0, 0xffe8b4, 0x0, 0x1047360}, Bcr = {0x47360, 0xbeffeaac, 0xbeffeb74, 0x0, 0xb64bd389, 
    0xb250d988, 0xbeffeae0, 0xffeac0}, Wvr = {0xbeffeab0}, Wcr = {0x0}, Padding2 = {0x47360, 0xb250d988}}
(gdb) bt
#0  DispatchManagedException (ex=..., isHardwareException=true) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/exceptionhandling.cpp:4649
dotnet/coreclr#1  0xb64efdf6 in HandleHardwareException (ex=0xbeffe3e0) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/exceptionhandling.cpp:5114
dotnet/coreclr#2  0xb6a8fd3a in SEHProcessException (pointers=0xbeffe670) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/seh.cpp:243
dotnet/coreclr#3  0xb6a91972 in common_signal_handler (code=11, siginfo=0xbeffe938, sigcontext=0xbeffe9b8, numParams=2) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/signal.cpp:597
dotnet/coreclr#4  0xb6a90d94 in sigsegv_handler (code=11, siginfo=0xbeffe938, context=0xbeffe9b8) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/signal.cpp:266
dotnet/coreclr#5  <signal handler called>
dotnet/coreclr#6  0xb650b314 in JIT_WriteBarrier () at patchedcode.S:413
dotnet/coreclr#7  0xb1adb8d6 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)

ps. I'd need to test without libunwind-arm signal-handler patch later as well.

myungjoo commented 8 years ago

Syscalls from the SIGSEGV handler of coreclr-arm:

gettid ?
getrlimit ?
open OK (except for pipe)
rt_sigprocmask ? (sigprocmask OK)
fstat64 ? (fstat OK)
mmap2 ?
read OK (except for slow devices)
close OK
munmap ?
sched_getaffinity ?
msync ?

myungjoo commented 8 years ago

Potentially dangerous behaviors in sigsegv_handler():

calling malloc. (three instances found in a single execution of sigsegv_handler() all via GetStackLimit(), which seems good to be cached per thread; you don't change the stacklimit of a thread, do you?)

(gdb) c
Continuing.
GOT the exception 1
Starting 2...

Program received signal SIGSEGV, Segmentation fault.
0xb650afd4 in JIT_WriteBarrier () at patchedcode.S:413
413 patchedcode.S: No such file or directory.
(gdb) break printf
Breakpoint 5 at 0xb6d8ba12: file printf.c, line 28.
(gdb) break malloc
Breakpoint 6 at 0xb6da7e88: malloc. (2 locations)
(gdb) c
Continuing.

Breakpoint 6, __GI___libc_malloc (bytes=360) at malloc.c:2875
2875    {
(gdb) bt
#0  __GI___libc_malloc (bytes=360) at malloc.c:2875
dotnet/coreclr#1  0xb6d9b2e8 in __fopen_internal (filename=0xb6faa128 "/proc/self/maps", mode=0xb6faa124 "rce", is32=1) at iofopen.c:73
dotnet/coreclr#2  0xb6d9b34e in _IO_new_fopen (filename=<optimized out>, mode=<optimized out>) at iofopen.c:103
dotnet/coreclr#3  0xb6fa2a2e in pthread_getattr_np (thread_id=3070206544, attr=0xbeffe358) at pthread_getattr_np.c:77
dotnet/coreclr#4  0xb6b36614 in CorUnix::CPalThread::GetStackLimit () at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/thread/thread.cpp:2674
dotnet/coreclr#5  0xb6a8f9a6 in SEHProcessException (pointers=0xbeffe670) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/seh.cpp:229
dotnet/coreclr#6  0xb6a91632 in common_signal_handler (code=11, siginfo=0xbeffe938, sigcontext=0xbeffe9b8, numParams=2) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/signal.cpp:597
dotnet/coreclr#7  0xb6a90a54 in sigsegv_handler (code=11, siginfo=0xbeffe938, context=0xbeffe9b8) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/signal.cpp:266
dotnet/coreclr#8  <signal handler called>
dotnet/coreclr#9  0xb650afd4 in JIT_WriteBarrier () at patchedcode.S:413
dotnet/coreclr#10 0xb1cf08d6 in ?? ()

...

Breakpoint 6, __GI___libc_malloc (bytes=120) at malloc.c:2875
2875    {
(gdb) bt
#0  __GI___libc_malloc (bytes=120) at malloc.c:2875
dotnet/coreclr#1  0xb6d9bb94 in _IO_getdelim (lineptr=lineptr@entry=0xbeffe290, n=n@entry=0xbeffe294, delimiter=delimiter@entry=10, fp=fp@entry=0xc9dd8) at iogetdelim.c:66
dotnet/coreclr#2  0xb6fa2a8a in pthread_getattr_np (thread_id=3070206544, attr=0xbeffe358) at pthread_getattr_np.c:112
dotnet/coreclr#3  0xb6b36614 in CorUnix::CPalThread::GetStackLimit () at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/thread/thread.cpp:2674
dotnet/coreclr#4  0xb6a8f9a6 in SEHProcessException (pointers=0xbeffe670) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/seh.cpp:229
dotnet/coreclr#5  0xb6a91632 in common_signal_handler (code=11, siginfo=0xbeffe938, sigcontext=0xbeffe9b8, numParams=2) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/signal.cpp:597
dotnet/coreclr#6  0xb6a90a54 in sigsegv_handler (code=11, siginfo=0xbeffe938, context=0xbeffe9b8) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/signal.cpp:266
dotnet/coreclr#7  <signal handler called>
dotnet/coreclr#8  0xb650afd4 in JIT_WriteBarrier () at patchedcode.S:413
dotnet/coreclr#9  0xb1cf08d6 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) 

Breakpoint 6, __GI___libc_malloc (bytes=bytes@entry=32) at malloc.c:2875
2875    {
(gdb) bt
#0  __GI___libc_malloc (bytes=bytes@entry=32) at malloc.c:2875
dotnet/coreclr#1  0xb6da859e in __GI___libc_realloc (oldmem=0x0, bytes=32) at malloc.c:2971
dotnet/coreclr#2  0xb6fa29b6 in pthread_getattr_np (thread_id=3070206544, attr=0xbeffe358) at pthread_getattr_np.c:163
dotnet/coreclr#3  0xb6b36614 in CorUnix::CPalThread::GetStackLimit () at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/thread/thread.cpp:2674
dotnet/coreclr#4  0xb6a8f9a6 in SEHProcessException (pointers=0xbeffe670) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/seh.cpp:229
dotnet/coreclr#5  0xb6a91632 in common_signal_handler (code=11, siginfo=0xbeffe938, sigcontext=0xbeffe9b8, numParams=2) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/signal.cpp:597
dotnet/coreclr#6  0xb6a90a54 in sigsegv_handler (code=11, siginfo=0xbeffe938, context=0xbeffe9b8) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/signal.cpp:266
dotnet/coreclr#7  <signal handler called>
dotnet/coreclr#8  0xb650afd4 in JIT_WriteBarrier () at patchedcode.S:413
dotnet/coreclr#9  0xb1cf08d6 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

janvorli commented 8 years ago

@myungjoo there is no problem in calling the unsafe syscalls, malloc and other function from the syscall handler as long as we can guarantee that the faulting code was running in the jitted code or in the few helpers. I have recently reviewed and fixed the hardware exception handling path to get rid of all such calls until we know we were running the safe code. But I could have missed something.

myungjoo commented 8 years ago

@janvori hmm.. then, I should search for other possibilities as well.

Anyway, in common_signal_handler(), in case of error (not running on top of GDB), it returns from SEHProcessException() and goes to PROCNotifyProcessShutdown() of sigsegv_handler().

It is going to consume more time because I cannot use GDB for the bug that appears only when GDB is not used. :(

Without running on top of GDB, the if (g_safeExceptionCheckFunction(pointers->ContextRecord, pointers->ExceptionRecord)) condition of SEHProcessException() sometimes becomes false, which makes it unable to handle the SIGSEGV.

parjong commented 8 years ago

It seems that JIT/Regression/CLR-x86-JIT/V1-M09.5-PDC/b16335/b16335/b16335 testcase failure is also related with this issue:

(gdb) r
Starting program: /mnt/dotnet-latest/Linux.arm.Debug/corerun -c dotnet-latest/Linux.arm.Debug/ ut/JIT/Regression/CLR-x86-JIT/V1-M09.5-PDC/b16335/b16335/b16335.exe
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.

Program received signal SIGSEGV, Segmentation fault.
0x7676ed64 in JIT_WriteBarrier () at patchedcode.S:413
413     patchedcode.S: No such file or directory.
(gdb) bt
#0  0x7676ed64 in JIT_WriteBarrier () at patchedcode.S:413
dotnet/coreclr#1  0x71dda4e6 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) bt
#0  0x7676ed64 in JIT_WriteBarrier () at patchedcode.S:413
dotnet/coreclr#1  0x71dda4e6 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x763f2a62 in ?? () from /lib/arm-linux-gnueabihf/libunwind-arm.so.8
(gdb) bt
#0  0x763f2a62 in ?? () from /lib/arm-linux-gnueabihf/libunwind-arm.so.8
dotnet/coreclr#1  0x763f350c in _Uarm_step () from /lib/arm-linux-gnueabihf/libunwind-arm.so.8
dotnet/coreclr#2  0x76b184fe in PAL_VirtualUnwind (context=0x7effd9c0, contextPointers=0x0) at /home/parjong/projects/toolchains/coreclr/src/pal/src/exception/seh-unwind.cpp:281
dotnet/coreclr#3  0x765d37bc in Thread::VirtualUnwindToFirstManagedCallFrame (pContext=0x7effd9c0) at /home/parjong/projects/toolchains/coreclr/src/vm/stackwalk.cpp:781
dotnet/coreclr#4  0x7675b980 in DispatchManagedException (ex=...) at /home/parjong/projects/toolchains/coreclr/src/vm/exceptionhandling.cpp:4649
dotnet/coreclr#5  0x767538e4 in HandleHardwareException (ex=0x7effdfe8) at /home/parjong/projects/toolchains/coreclr/src/vm/exceptionhandling.cpp:5103
dotnet/coreclr#6  0x76b181ee in SEHProcessException (pointers=0x7effe278) at /home/parjong/projects/toolchains/coreclr/src/pal/src/exception/seh.cpp:243
dotnet/coreclr#7  0x76b19ec6 in common_signal_handler (code=11, siginfo=0x7effe570, sigcontext=0x7effe5f0, numParams=2) at /home/parjong/projects/toolchains/coreclr/src/pal/src/exception/signal.cpp:617
dotnet/coreclr#8  0x76b19284 in sigsegv_handler (code=11, siginfo=0x7effe570, context=0x7effe5f0) at /home/parjong/projects/toolchains/coreclr/src/pal/src/exception/signal.cpp:285
dotnet/coreclr#9  0x71a1a9d2 in sigsegv_handler (code=11, siginfo=0x7effe570, context=0x7effe5f0) at /home/parjong/projects/toolchains/coreclr/src/pal/src/exception/signal.cpp:290
dotnet/coreclr#10 <signal handler called>
dotnet/coreclr#11 0x7676ed64 in JIT_WriteBarrier () at patchedcode.S:413
dotnet/coreclr#12 0x71dda4e6 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)

myungjoo commented 8 years ago

@parjong If SIGSEGV is coming from libunwind, it is not the bug of this one although it might only mean that you don't have the libunwind fixes included. (will try to push libunwind to merge the patches soon again along with additional libunwind fixes)

myungjoo commented 8 years ago

@janvorli IsSafeToHandleHardwareException is behaving strangely.

With GDB, it always thinks that the corresponding exception is coming from JitHelper (IsIPInMarkedJitHelper() is true) Without GDB, it often (more than 50%) thinks that it is `false'.

I'll verify if the input PC is really JitHelper (write barrier) or not when it is not on GDB (with GDB, it is always).

myungjoo commented 8 years ago

OK. Looks like I found what is causing it. With GDB, JIT_WriteBarrier has dmb sy before str r1, [r0, #0] offsetting +4 and without GDB, str r1, [r0, #0] is the first instruction and the offset is 0, which confuses IsIPInMarkedJitHelper. I'll write and show a fix soon.

myungjoo commented 8 years ago

The constants used by IsIPInMarkedJitHelper are pointing to THUMB instruction, which has the least bit set to denote that they are THUMB. However, their real locations do not have their least bit set.

For example, if the real location is at 100, the location constant is 101, which makes the condition statement of CHECK_RANGE errornous. However, with GDB, JIT thinks it has additional threads and emits dmb, which makes the errornous condition work ( +4 > +1).

To be safe, for all ARM/THUMB PC and values compared with PC should reset its least bit. (they are aligned by 16bit anyway)

myungjoo commented 8 years ago

@parjong note that, after this coreclr fix, libunwind-arm without any fixes suggested by me works for coreclr anyway.

The fixes for libunwind-arm are valid, but after the fix for coreclr, we do not visit (or aren't affected by) such corner cases of libunwind-arm.