Closed myungjoo closed 4 years ago
When corerun dies, it is at
0xb650c970 <JIT_WriteBarrier> dmb sy
>0xb650c974 <JIT_WriteBarrier+4> str r1, [r0, #0]
0xb650c976 <JIT_WriteBarrier+6> movw r3, dotnet/coreclr#12
0xb650c97a <JIT_WriteBarrier+10> movt r3, #45680 ; 0xb270
where
(gdb) info registers
r0 0x4 4
r1 0xb270da6c 2993740396
r2 0xb370235c 3010470748
r3 0xb650c971 3058747761
r4 0xb1cf0039 2983133241
r5 0xbefff310 3204444944
r6 0xbefff1f0 3204444656
r7 0xbeffecf8 3204443384
r8 0xb6d2c000 3067265024
r9 0x70470200 1883701760
r10 0xb6d2c000 3067265024
r11 0xbeffecc0 3204443328
r12 0xb1cf0901 2983135489
sp 0xbeffec68 0xbeffec68
lr 0xb1cf095d -1311831715
pc 0xb650c974 0xb650c974 <JIT_WriteBarrier+4>
cpsr 0x20030 131120
Did JIT-code just give us a pointer of 0x04
to be referenced?
(my first rough guess is on a case where JIT codegen wrote movt/movw after assigning r0
when it should have done so before assigning r0
.)
The JIT code creating the trouble:
0xb1cf0900: stmdb sp!, {r4, r10, r11, lr}
0xb1cf0904: sub sp, dotnet/coreclr#80 ; 0x50
0xb1cf0906: add.w r11, sp, dotnet/runtime#3896 ; 0x58
0xb1cf090a: sub.w r2, r11, dotnet/runtime#3893 ; 0x54
0xb1cf090e: movs r3, dotnet/coreclr#8
0xb1cf0910: movs r0, #0
0xb1cf0912: movs r1, #0
0xb1cf0914: stmia r2!, {r0, r1}
0xb1cf0916: subs r3, dotnet/coreclr#1
0xb1cf0918: bhi.n 0xb1cf0914
0xb1cf091a: str r0, [r2, #0]
0xb1cf091c: add r0, sp, dotnet/coreclr#96 ; 0x60
0xb1cf091e: str r0, [sp, dotnet/runtime#3888] ; 0x4c
0xb1cf0920: movw r3, #36684 ; 0x8f4c
0xb1cf0924: movt r3, #46482 ; 0xb592
0xb1cf0928: ldr r3, [r3, #0]
0xb1cf092a: cmp r3, #0
0xb1cf092c: beq.n 0xb1cf0938
0xb1cf092e: movw r3, #55165 ; 0xd77d
0xb1cf0932: movt r3, #46660 ; 0xb644
0xb1cf0936: blx r3
0xb1cf0938: nop
0xb1cf093a: nop
0xb1cf093c: movw r3, dotnet/coreclr#2936 ; 0xb78
0xb1cf0940: movt r3, #45936 ; 0xb370
0xb1cf0944: ldr r3, [r3, #0]
0xb1cf0946: movw r2, dotnet/coreclr#9052 ; 0x235c
0xb1cf094a: movt r2, #45936 ; 0xb370
0xb1cf094e: ldr r1, [r2, #0]
0xb1cf0950: adds r0, r3, dotnet/coreclr#4
0xb1cf0952: movw r3, #51569 ; 0xc971
0xb1cf0956: movt r3, #46672 ; 0xb650
0xb1cf095a: blx r3
=> 0xb1cf095c: nop
(gdb) p/x *(0xb3700b78)
$12 = 0x0
(gdb)
where r3 heads to JIT_WriteBarrier
.
Looks like JIT is using uninitialized pointer value ( (it was just failing GDB watchpoint. someone is setting the value as 0) 0xb3700b78
in the above instance).
(note: x64/Linux/Debug succeeds)
Happens in the JIT code generated by the context of:
#0 CodeGen::genGCWriteBarrier (this=0x6b034, tgt=0x6d218, wbf=GCInfo::WBF_BarrierUnchecked) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/codegencommon.cpp:3854
dotnet/coreclr#1 0xb6965936 in CodeGen::WriteBarrier (this=0x6b034, tgt=0x6d218, assignVal=0x72e00, tgtAddrReg=8) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/codegenlegacy.cpp:3453
dotnet/coreclr#2 0xb6971840 in CodeGen::genCodeForTreeSmpOpAsg (this=0x6b034, tree=0x6d290) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/codegenlegacy.cpp:12092
dotnet/coreclr#3 0xb696ed2e in CodeGen::genCodeForTreeSmpOp (this=0x6b034, tree=0x6d290, destReg=22527, bestReg=0) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/codegenlegacy.cpp:9545
dotnet/coreclr#4 0xb695d428 in CodeGen::genCodeForTree (this=0x6b034, tree=0x6d290, destReg=22527, bestReg=0) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/codegenlegacy.cpp:12547
dotnet/coreclr#5 0xb6977cb2 in CodeGen::genCodeForBBlist (this=0x6b034) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/codegenlegacy.cpp:13101
dotnet/coreclr#6 0xb6994262 in CodeGen::genGenerateCode (this=0x6b034, codePtr=0xbeffd480, nativeSizeOfCode=0xbeffd830) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/codegencommon.cpp:3026
dotnet/coreclr#7 0xb69a43c0 in Compiler::compCompile (this=0x697f8, methodCodePtr=0xbeffd480, methodCodeSize=0xbeffd830, compileFlags=0xbeffd484)
at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/compiler.cpp:3911
dotnet/coreclr#8 0xb69a621a in Compiler::compCompileHelper (this=0x697f8, classPtr=0xb5928a34, compHnd=0xbeffd930, methodInfo=0xbeffd88c, methodCodePtr=0xbeffd480, methodCodeSize=0xbeffd830,
compileFlags=0xbeffd484, instVerInfo=INSTVER_GENERIC_PASSED_VERIFICATION) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/compiler.cpp:4911
dotnet/coreclr#9 0xb69a51a4 in Compiler::compCompile(CORINFO_METHOD_STRUCT_*, CORINFO_MODULE_STRUCT_*, ICorJitInfo*, CORINFO_METHOD_INFO*, void**, unsigned int*, CORJIT_FLAGS*)::$_0::operator()(Compiler::compCompi
le(CORINFO_METHOD_STRUCT_*, CORINFO_MODULE_STRUCT_*, ICorJitInfo*, CORINFO_METHOD_INFO*, void**, unsigned int*, CORJIT_FLAGS*)::__JITParam*) const (this=0xbeffd19c, __JITpParam=0xbeffd1a0)
at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/compiler.cpp:4306
dotnet/coreclr#10 0xb69a504c in Compiler::compCompile (this=0x697f8, methodHnd=0xb5929910, classPtr=0xb5928a34, compHnd=0xbeffd930, methodInfo=0xbeffd88c, methodCodePtr=0xbeffd480, methodCodeSize=0xbeffd830,
compileFlags=0xbeffd484) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/compiler.cpp:4328
dotnet/coreclr#11 0xb69ad544 in jitNativeCode(CORINFO_METHOD_STRUCT_*, CORINFO_MODULE_STRUCT_*, ICorJitInfo*, CORINFO_METHOD_INFO*, void**, unsigned int*, CORJIT_FLAGS*, void*)::$_3::operator()(jitNativeCode(CORINF
O_METHOD_STRUCT_*, CORINFO_MODULE_STRUCT_*, ICorJitInfo*, CORINFO_METHOD_INFO*, void**, unsigned int*, CORJIT_FLAGS*, void*)::__JITParam*) const::{lambda(jitNativeCode(CORINFO_METHOD_STRUCT_*, CORINFO
_MODULE_STRUCT_*, ICorJitInfo*, CORINFO_METHOD_INFO*, void**, unsigned int*, CORJIT_FLAGS*, void*)::$_3::operator()(jitNativeCode(CORINFO_METHOD_STRUCT_*, CORINFO_MODULE_STRUCT_*, ICorJitInfo*, CORINF
O_METHOD_INFO*, void**, unsigned int*, CORJIT_FLAGS*, void*)::__JITParam*) const::__JITParam*)#1}::operator()(jitNativeCode(CORINFO_METHOD_STRUCT_*, CORINFO_MODULE_STRUCT_*, ICorJitInfo*, CORINFO_METH
OD_INFO*, void**, unsigned int*, CORJIT_FLAGS*, void*)::$_3::operator()(jitNativeCode(CORINFO_METHOD_STRUCT_*, CORINFO_MODULE_STRUCT_*, ICorJitInfo*, CORINFO_METHOD_INFO*, void**, unsigned int*, CORJI
T_FLAGS*, void*)::__JITParam*) const::__JITParam*) const (this=0xbeffd288, __JITpParam=0xbeffd28c) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/compiler.cpp:5522
dotnet/coreclr#12 0xb69a71f6 in jitNativeCode(CORINFO_METHOD_STRUCT_*, CORINFO_MODULE_STRUCT_*, ICorJitInfo*, CORINFO_METHOD_INFO*, void**, unsigned int*, CORJIT_FLAGS*, void*)::$_3::operator()(jitNativeCode(CORINF
O_METHOD_STRUCT_*, CORINFO_MODULE_STRUCT_*, ICorJitInfo*, CORINFO_METHOD_INFO*, void**, unsigned int*, CORJIT_FLAGS*, void*)::__JITParam*) const (this=0xbeffd350, __JITpParam=0xbeffd358)
at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/compiler.cpp:5551
dotnet/coreclr#13 0xb69a7052 in jitNativeCode (methodHnd=0xb5929910, classPtr=0xb5928a34, compHnd=0xbeffd930, methodInfo=0xbeffd88c, methodCodePtr=0xbeffd480, methodCodeSize=0xbeffd830, compileFlags=0xbeffd484,
inlineInfoPtr=0x0) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/compiler.cpp:5553
dotnet/coreclr#14 0xb6872af0 in CILJit::compileMethod (this=0xb6d4d834 <operator new(unsigned int, CILJitSingletonAllocator const&)::CILJitBuff>, compHnd=0xbeffd930, methodInfo=0xbeffd88c, flags=1048596,
entryAddress=0xbeffd834, nativeSizeOfCode=0xbeffd830) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/jit/ee_il_dll.cpp:286
dotnet/coreclr#15 0xb632695e in invokeCompileMethodHelper (jitMgr=0x39a38, comp=0xbeffd930, info=0xbeffd88c, flags=1048596, flags2=0, nativeEntry=0xbeffd834, nativeSizeOfCode=0xbeffd830)
at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/jitinterface.cpp:11667
dotnet/coreclr#16 0xb6326a2a in invokeCompileMethod (jitMgr=0x39a38, comp=0xbeffd930, info=0xbeffd88c, flags=1048596, flags2=0, nativeEntry=0xbeffd834, nativeSizeOfCode=0xbeffd830)
at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/jitinterface.cpp:11722
dotnet/coreclr#17 0xb6326c4e in CallCompileMethodWithSEHWrapper(EEJitManager*, CEEInfo*, CORINFO_METHOD_INFO*, unsigned int, unsigned int, unsigned char**, unsigned int*, MethodDesc*)::$_4::operator()(CallCompileMe
thodWithSEHWrapper(EEJitManager*, CEEInfo*, CORINFO_METHOD_INFO*, unsigned int, unsigned int, unsigned char**, unsigned int*, MethodDesc*)::Param*) const (this=0xbeffd62c, pParam=0xbeffd630)
at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/jitinterface.cpp:11784
dotnet/coreclr#18 0xb6326b8c in CallCompileMethodWithSEHWrapper (jitMgr=0x39a38, comp=0xbeffd930, info=0xbeffd88c, flags=1048596, flags2=0, nativeEntry=0xbeffd834, nativeSizeOfCode=0xbeffd830, ftn=0xb5929910)
at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/jitinterface.cpp:11827
...
(224th invokation of genGCWriteBarrier generates 0xb1cf0952-095a above)
Looks like JIT/ARM gen of "lea" in this instance is fishy.
The code in question, making r0 ( = r3 + 4) to access 0x04,
0xb1cf093c: movw r3, dotnet/coreclr#2936 ; 0xb78
0xb1cf0940: movt r3, #45936 ; 0xb370
0xb1cf0944: ldr r3, [r3, #0]
(resulting r3 = 0)
is JIT-created
by getEmitter()->emitIns_R_C()
called by inst_RV_TT()
called by genCodeForTreeLeaf()
called by if (kind & GTK_LEAF)
in CodeGen::genCodeForTree()
Generating code for tree [000010] destReg = 0xb6d27000 bestReg = 0x0
called by CodeGen::genMakeIndAddrMode()
called by CodeGen::genMakeAddressable()
in codegenlegacy.cpp
.
The invalid indirect pointer 0xb3700b78
(*0xb5700b78 = 0
) is retrived at addr = (ssize_t)emitComp->info.compCompHnd->getFieldAddress(fldHnd, NULL);
in emitIns_R_C()
.
The code above is the first JIT-created code that hits 0xb3700b78
. (Added printf for all instruction emit functions for JIT debugging) although someone is calling getFieldAddress resulting in 0xb3700b78
a bit before using it, no one writes there.
It looks as if JIT forgot to setup static objects.
With a simpler test case that has the same effect on CoreCLR, it appears to be related with NullReference handling. The following simplified test case incurs the same JIT code that commands WriteBarrier to load 0x04
.
bash-3.2# ./corerun 143837.mod.exe
Starting 1...
GOT the exception 1
Starting 2...
Segmentation fault (core dumped)
bash-3.2# cat 143837.mod.cs
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.
//
using System;
public class Test
{
public static int Main()
{
TestClass TC = null;
Console.WriteLine("Starting 1...");
try
{
Console.WriteLine(TC.strField);
} catch (NullReferenceException e)
{
Console.WriteLine("GOT the exception 1");
}
Console.WriteLine("Starting 2...");
try
{
TC.strField = "let's go";
TC.abcd = TC.abc = 0;
} catch (NullReferenceException e)
{
Console.WriteLine("GOT the exception 2");
}
return 100;
}
};
public class TestClass
{
public string strField;
public TestClass(String strIn)
{
strField = strIn;
}
public virtual string strProperty
{
get
{
return strField;
}
set
{
strField = value;
}
}
}
bash-3.2#
Anyway, regardless of the strField's offset in the TestClass, r0
is trying to load 0x04
(r3 + 4
).
Rebased with d0e1bcc4d881cb3757cf335a99a64b34d86d8e62 and the point of segmentation fault has changed. (probably, pointing at the same logical location, but, need more analysis..)
The currently segmentation-faulting JIT-generated code with the example code above is:
(segmentation faults after Starting 2...
)
0xb1ad3832: movs r0, r0
0xb1ad3834: movs r0, r0
0xb1ad3836: movs r0, r0
0xb1ad3838: push {r2, r6, lr}
0xb1ad383a: cbz r6, 0xb1ad3868
0xb1ad383c: stmdb sp!, {r4, r10, r11, lr}
0xb1ad3840: sub sp, dotnet/runtime#3868 ; 0x28
0xb1ad3842: add.w r11, sp, dotnet/coreclr#48 ; 0x30
0xb1ad3846: sub.w r2, r11, dotnet/runtime#3870 ; 0x2c
0xb1ad384a: movs r0, #0
0xb1ad384c: movs r1, #0
0xb1ad384e: stmia r2!, {r0, r1}
0xb1ad3850: stmia r2!, {r0, r1}
0xb1ad3852: stmia r2!, {r0, r1}
0xb1ad3854: str r0, [r2, #0]
0xb1ad3856: add r0, sp, dotnet/coreclr#56 ; 0x38
0xb1ad3858: str r0, [sp, dotnet/coreclr#36] ; 0x24
0xb1ad385a: movw r3, dotnet/coreclr#16220 ; 0x3f5c
0xb1ad385e: movt r3, #46482 ; 0xb592
0xb1ad3862: ldr r3, [r3, #0]
0xb1ad3864: cmp r3, #0
0xb1ad3866: beq.n 0xb1ad3872
0xb1ad3868: movw r3, #37605 ; 0x92e5
0xb1ad386c: movt r3, #46660 ; 0xb644
0xb1ad3870: blx r3
0xb1ad3872: nop
0xb1ad3874: movs r3, #0
0xb1ad3876: str r3, [sp, dotnet/coreclr#20]
0xb1ad3878: movw r3, dotnet/coreclr#9024 ; 0x2340
0xb1ad387c: movt r3, #45904 ; 0xb350
0xb1ad3880: ldr r0, [r3, #0]
0xb1ad3882: movw r3, dotnet/coreclr#12761 ; 0x31d9
0xb1ad3886: movt r3, #45485 ; 0xb1ad
0xb1ad388a: blx r3
0xb1ad388c: nop
0xb1ad388e: nop
0xb1ad3890: ldr r3, [sp, dotnet/coreclr#20]
=> 0xb1ad3892: ldr r0, [r3, dotnet/coreclr#4]
0xb1ad3894: movw r3, dotnet/coreclr#12761 ; 0x31d9
0xb1ad3898: movt r3, #45485 ; 0xb1ad
0xb1ad389c: blx r3
0xb1ad389e: nop
0xb1ad38b8: blx r3
(gdb) bt
#0 0xb1ad3892 in ?? ()
dotnet/coreclr#1 0xb1ad388c in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) info registers
r0 0x0 0
r1 0x0 0
r2 0xbeffec98 3204443288
r3 0x0 0
r4 0xb1ad3039 2980917305
r5 0xbefff300 3204444928
r6 0xbefff1e0 3204444640
r7 0xbeffece8 3204443368
r8 0xb6d2b000 3067260928
r9 0xef87ed00 4018662656
r10 0xb6d2b000 3067260928
r11 0xbeffecd8 3204443352
r12 0xb6d2b3b0 3067261872
sp 0xbeffeca8 0xbeffeca8
lr 0xb1ad388d -1314047859
pc 0xb1ad3892 0xb1ad3892
cpsr 0x600e0030 1611530288
(gdb)
The JIT-generated code is storing 0 at sp+20
and try to load +4
of that stored sp+20 (=0)
, which inevitably causes segmentation fault. This does not happen with x64.
@myungjoo I assume you are seeing the problem without gdb too, right? Since null reference exception is a hardware exception (SIGSEGV) that gets caught by gdb first and then if you let it continue, it will pass it to CoreCLR for handling. The JIT_WriteBarrier is one of a few places out of managed code where the SIGSEGV is converted to a managed exception (see IsIPInMarkedJitHelper function here: https://github.com/dotnet/coreclr/blob/master/src/vm/excep.cpp#L7279)
@janvorli Yes, I'm seeing segmentation fault without GDB as well. Anyway, if CoreCLR catches SIGSEGV of JITted code, I might be seeing something that shouldn't matter. I'll keep that fact in mind. Thanks.
That's how null references are detected (and also division by zero, etc.). In your case, could you try to set a breakpoint at HandleHardwareException and see if it gets called? It should happen in your case (after you hit the SIGSEGV first in the GDB and do "c", it should go there).
It is because at SIGSEGV, JIT_WriteBarrier
is not unwindable and VirtualUnwindToFirstManagedCallFrame
called by DispatchedManagedEception
fails to unwind up to the managed frame (one step deeper than JIT_WriteBarrier
).
A fix being tested. It appears that the whole asmhelpers.S may need a review/fix for the possibility of exception/signal handling. (other assembly functions that I've assume that they are not calling anyone else should be reconsidered.)
OK. the fix (adding PROLOG_PUSH
, PROLOG_STACK_STAVE
to the write barriers) is allowing to handle the secondary exceptions of 143837
. However, it exposes another "unhandled exception" in CoreCLR. The fix is going to be bigger than simply adding assembly macros.
Starting 1...
GOT the exception 1
Starting 2...
GOT the exception 2
Assert failure(PID 15489 [0x00003c81], Thread: 15489 [0x3c81]): Consistency check failed: FAILED: state.fFound
File: /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/exceptionhandling.cpp Line: 363
Image: /sdcard/unittest/corerun
**** MessageBox invoked, title 'corerun - Assert Failure (PID 15489, Thread 15489/0x3c81)' ****
Consistency check failed: FAILED: state.fFound
/source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/exceptionhandling.cpp, Line: 363
Abort - Kill program
Retry - Debug
Ignore - Keep running
Image:
/sdcard/unittest/corerun
********
Aborted (core dumped)
bash-3.2# cat 143837.mod.cs
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.
//
using System;
public class Test
{
public static int Main()
{
TestClass TC = null;
Console.WriteLine("Starting 1...");
try
{
Console.WriteLine(TC.strField);
} catch (NullReferenceException e)
{
Console.WriteLine("GOT the exception 1");
}
Console.WriteLine("Starting 2...");
try
{
TC.strField = "let's go";
TC.abcd = TC.abc = 0;
} catch (NullReferenceException e)
{
Console.WriteLine("GOT the exception 2");
}
return 100;
}
};
public class TestClass
{
public int abc;
public int abcd;
public string strField;
public TestClass(String strIn)
{
strField = strIn;
}
public virtual string strProperty
{
get
{
return strField;
}
set
{
strField = value;
}
}
}
bash-3.2#
(the original 143837 shows the same behavior)
@myungjoo I wonder why the write barrier is not unwindable in its original form. The write barrier is a very performance sensitive thing and adding instructions just to make it unwindable seems unfortunate. Does it mean that the ARM unwinder is unable to unwind a simple function with no stack usage?
@janvorli Yes, PAL_VirtualUnwind() behaves erratic at JIT_WriteBarrier (PAL_VirtualUwind's return value = 0 / ContextPC = 0, which denotes that it has hit the bottom of the stack.)
If WriteBarrier is very performance sensitive and is called too frequently by JITted code, then, we need to embed the code block into the JIT code (inlining or bx/b), not making JIT to call (blx/bl) it, especially when it may generate SIGSEGV that requires unwinding.
ps. Testing some rough ideas that may allow libunwind to unwind without actually saving anything to the stack..
@myungjoo I think I can see why it is unable to unwind the function. It seems the problem is that the JIT_WriteBarrier is defined as follows:
LEAF_ENTRY JIT_WriteBarrier, _TEXT
.space (0x84)
LEAF_END_MARKED JIT_WriteBarrier, _TEXT
So when the assembler compiles it, it doesn't see any code and so it cannot create proper unwind data. Maybe we should put in just fake "bx lr" before the .space (and update the 0x84 accordingly). It would get overwritten by the real barrier code at runtime, but that code should have the same unwind semantics as "bx lr"
I thought similar things and tried to add the following in the .space()
area of patchedcode.S
:
1.
.pad #0
2.
.setfp sp, sp
3.
bx lr
along with .space()
realignment (bx lr in thumb = 0x2)
ALL fails with PAL_VirtualUnwind
.
Next try: .unwind_raw
Could it be a bug in the unwinder? It seems it should be able to unwind through simple bx lr, since it is as simple as setting the PC to LR.
GLIBC seems to do some tricks for Linux/ARM to let unwind work while not saving registers into the stack. I'll try to follow GLIBC's sysdeps/unix/sysv/linux/arm/sigrestorer.S.
Anyway, this doesn't seem to be a bug of libunwind because it tries to decode the exidx entry, which doesn't exist with the current code or is incorrect with my trials without actually saving things in the stack.
If we fail to make it unwind the barrier in a reasonable way, we have one fallback option left. We can wrap the PAL_VirtualUnwind in a wrapper function that would first check if the IP is in the JIT_WriteBarrier range and if it is, unwind it just by setting the PC to the LR without calling into the PAL. We have a function to do such unwind that we use for the managed code - Thread::VirtualUnwindLeafCallFrame and that you can use. Or now that I think about it, the only place where we will need to unwind the JIT_WriteBarrier is the Thread::VirtualUnwindToFirstManagedCallFrame and it could only be the first frame. So we can add this piece of code ifdefed for ARM right before the loop:
if (IsIPInMarkedJitHelper(uControlPC))
{
return VirtualUnwindLeafCallFrame(pContext);
}
@myungjoo I am sorry, it should not be return, but rather this:
if (IsIPInMarkedJitHelper(uControlPC))
{
uControlPC = VirtualUnwindLeafCallFrame(pContext);
}
I'll try to follow GLIBC's sysdeps/unix/sysv/linux/arm/sigrestorer.S.
Nah.. that doesn't work. It (using cfi) actually saves registers into the stack.
Unless we have JIT_WriteBarrier (and its neighbors) embedded into the JIT-code or let the JIT-code to use b/bx, or sacrifice the performance (use of push {r7, lr}
), the workaround @janvorli suggested might be the answer.
if (IsIPInMarkedJitHelper(uControlPC)) { uControlPC = VirtualUnwindLeafCallFrame(pContext); }
I'll try to make an exception in exception handling for JIT_WriteBarrier, JIT_CheckedWriteBarrier, JIT_ByRefWriteBarrier as @janvorli has suggested.
Anyway, that expression should be in the loop, not before the loop. JIT_WriteBarrier
is the last native frame (the closest to the first managed), not the first native frame (the farthest from the first managed).
Or what about putting this in PAL_VirtualUnwind()'s heading to prevent calling libunwind-arm for JIT_WriteBarrier?
I have forgotten that when handling hardware exceptions, we have to unwind all the way back from the DispatchManagedException through the signal trampoline before we get to the actual failing function. So you are right.
mzx@kohaku:/source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm$ git diff stackwalk.cpp
diff --git a/src/vm/stackwalk.cpp b/src/vm/stackwalk.cpp
index 3c2dfd7..2e3b876 100644
--- a/src/vm/stackwalk.cpp
+++ b/src/vm/stackwalk.cpp
@@ -791,6 +791,16 @@ UINT_PTR Thread::VirtualUnwindToFirstManagedCallFrame(T_CONTEXT* pContext)
{
break;
}
+#ifdef _ARM_
+ if (IsIPInMarkedJitHelper(uControlPc))
+ {
+ // JIT_WriteBarrir does not support libunwind.
+ printf("%s:%d from PC %8.8x\n", __func__, __LINE__, uControlPc);
+ uControlPc = VirtualUnwindCallFrame(pContext);
+ printf("%s:%d to PC %8.8x\n", __func__, __LINE__, uControlPc);
+ return TRUE;
+ }
+#endif // _ARM_
#endif // !FEATURE_PAL
}
Starting 1...
GOT the exception 1
Starting 2...
VirtualUnwindToFirstManagedCallFrame:798 from PC b64e47b4
Segmentation fault (core dumped)
bash-3.2#
0xb64e47b4
is the JIT_WriteBarrier
and VirtualUnwindCallFrame cannot unwind it.
GDB says:
VirtualUnwindToFirstManagedCallFrame:798 from PC b65097b4
Program received signal SIGSEGV, Segmentation fault.
0xb6326e1c in EECodeInfo::GetFunctionEntry (this=0xbeffdca0) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/jitinterface.cpp:13711
13711 #define kRBP 5
(gdb) bt
#0 0xb6326e1c in EECodeInfo::GetFunctionEntry (this=0xbeffdca0) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/jitinterface.cpp:13711
dotnet/coreclr#1 0xb637186a in Thread::VirtualUnwindCallFrame (pContext=0xbeffdda0, pContextPointers=0x0, pCodeInfo=0x0) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/stackwalk.cpp:611
dotnet/coreclr#2 0xb6371a98 in Thread::VirtualUnwindToFirstManagedCallFrame (pContext=0xbeffdda0) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/stackwalk.cpp:799
dotnet/coreclr#3 0xb64f6244 in DispatchManagedException (ex=...) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/exceptionhandling.cpp:4651
dotnet/coreclr#4 0xb64ee0e0 in HandleHardwareException (ex=0xbeffe3e0) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/exceptionhandling.cpp:5118
dotnet/coreclr#5 0xb6a8e8b2 in SEHProcessException (pointers=0xbeffe670) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/seh.cpp:243
dotnet/coreclr#6 0xb6a904ea in common_signal_handler (code=11, siginfo=0xbeffe938, sigcontext=0xbeffe9b8, numParams=2) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/signal.cpp:597
dotnet/coreclr#7 0xb6a8f90c in sigsegv_handler (code=11, siginfo=0xbeffe938, context=0xbeffe9b8) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/signal.cpp:266
dotnet/coreclr#8 <signal handler called>
dotnet/coreclr#9 0xb65097b4 in JIT_WriteBarrier () at patchedcode.S:413
dotnet/coreclr#10 0xb1cf08d6 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) p/x m_pJM
$1 = 0x0
Looks like I need to make another unwinding function for JIT Helpers. Anyway, I'll continue this issue tomorrow. :)
You have used wrong function :-). It was 'VirtualUnwindLeafCallFrame' in my example.
VirtualUnwindLeafCallFrame
does not work because the context has no valid LR value:
VirtualUnwindToFirstManagedCallFrame:798 from PC b64273b4
VirtualUnwindToFirstManagedCallFrame:800 to PC b64273b4
(b64273b4 is in JIT_WriteBarrier and its 'parent' is shown as if itself because LR = PC, which is "intentional" in libunwind-arm; it does not rely on LR value of cursor, which is accessible with unw_get_reg()
)
The resulting Context after calling VirtualUnwindLeafCallFrame is:
$8 = {ContextFlags = 0x20000b, R0 = 0xbeffdda0, R1 = 0xa766f300, R2 = 0xa766f300, R3 = 0xa766f300, R4 = 0xbeffe938, R5 = 0xb, R6 = 0xa766f300, R7 = 0xbeffe37c, R8 = 0xb6d2c000, R9 = 0xa766f300,
R10 = 0xb6d2c000, R11 = 0xbeffecd8, R12 = 0x0, Sp = 0xbeffdf50, Lr = 0xb64efcd9, Pc = 0xb64efcd8,
...
Even the first context given to VirtualUnwindToFirstManagedCallFrame has LR invalid (LR register is not going to be recovered after the first bl/blx
. It stays in the stack.):
Breakpoint 4, Thread::VirtualUnwindToFirstManagedCallFrame (pContext=0xafe42ffc) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/stackwalk.cpp:763
763 {
(gdb) n
772 PCODE uControlPc = GetIP(pContext);
(gdb) p/x *pContext
$9 = {ContextFlags = 0x20000b, R0 = 0xbeffdda0, R1 = 0x29c3d700, R2 = 0x29c3d700, R3 = 0x29c3d700, R4 = 0xbeffe938, R5 = 0xb, R6 = 0x29c3d700, R7 = 0xbeffdf48, R8 = 0xb6d2c000, R9 = 0x29c3d700,
R10 = 0xb6d2c000, R11 = 0xbeffecd8, R12 = 0x0, Sp = 0xbeffdd08, Lr = 0xb64f7df9, Pc = 0xb64f7df9,
...
(gdb) bt
#0 Thread::VirtualUnwindToFirstManagedCallFrame (pContext=0xbeffdda0) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/stackwalk.cpp:772
dotnet/coreclr#1 0xb64f7e3c in DispatchManagedException (ex=...) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/exceptionhandling.cpp:4651
dotnet/coreclr#2 0xb64efcd8 in HandleHardwareException (ex=0xbeffe3e0) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/exceptionhandling.cpp:5118
dotnet/coreclr#3 0xb6a8fde2 in SEHProcessException (pointers=0xbeffe670) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/seh.cpp:243
dotnet/coreclr#4 0xb6a91a1a in common_signal_handler (code=11, siginfo=0xbeffe938, sigcontext=0xbeffe9b8, numParams=2) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/signal.cpp:597
dotnet/coreclr#5 0xb6a90e3c in sigsegv_handler (code=11, siginfo=0xbeffe938, context=0xbeffe9b8) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/signal.cpp:266
dotnet/coreclr#6 <signal handler called>
dotnet/coreclr#7 0xb1adb892 in ?? ()
dotnet/coreclr#8 0xb1adb88c in ?? ()
Normally, functions without calls (bl/blx
) like JIT_WriteBarrier may keep lr
valid without saving registers into the stack.
However, with an hardware exception (SIGSEGV in this case), the exception handler (frame 6 in the above case) saves and restores all (r0 - r15). So, without proper unwinding information saving at JIT_WriteBarrier, we need to read information (lr
) saved by #6 <signal handler called>
for JIT_WriteBarrir #7
.
Libunwind-arm does not support such backtracking the backtracks. CoreCLR does not either.
We seem to let the loop unwind-the-unwind when we meet JITHelper Functions.
(need to study glibc for a while. assembly of <signal handler called>
is in a very unfamiliar format)
Ah.. this seems to be yet another issue from libunwind-arm. I can't sure if libunwind-arm's unw_handle_signal_frame
, which is supposed to unwind a signal frame, work or not, but it is not going there (unw_is_signal_frame
is not working) with #6 <signal handler called>
.
The op code to be compared if it is a signal handler from corerun + glibc is 0x07adf04f
(thumb/32b mov.w r7 0xad), which is not recognized by libunwind-arm's Gis_signal_frame.c preprocessor definitions.
I'll verify if correcting libunwind-arm works or not.
current:
1. CCLR-context -> unw_cursor
2. libunwind (unw_cursor)
3. unw_cursor->CCLR-context
4. goto 1 until X
to be
1. CCLR-context -> unw_cursor
2. libunwind (unw_cursor)
3. goto 2 until X
4. unw_cursor->CCLR-context
ps. there is a report that this is related with other failed test cases.
@myungjoo if the problem is related to the problem crossing the signal trampoline, I think we could make a change that I was considering for some time. If we used the exception context in the DispatchManagedException for the case of hardware exceptions instead of starting at the current frame, then we would not need to cross the trampoline at all. Could you please try the following change to make it work that way?
isHardwareException = false;
inside the catch
// Unwind the context to the first managed frame
CONTEXT frameContext;
if (isHardwareException)
{
frameContext = ex.ContextRecord;
frameContext.ContextFlags |= CONTEXT_EXCEPTION_ACTIVE;
if (IsIPInMarkedJitHelper(GetIP(&frameContext)))
{
PAL_VirtualUnwind(&frameContext, NULL);
}
}
else
{
RtlCaptureContext(&frameContext);
UINT_PTR currentSP = GetSP(&frameContext);
if (Thread::VirtualUnwindToFirstManagedCallFrame(&frameContext) == 0)
{
// There are no managed frames on the stack, so we need to continue unwinding using C++ exception
// handling
break;
}
UINT_PTR firstManagedFrameSP = GetSP(&frameContext);
// Check if there is any exception holder in the skipped frames. If there is one, we need to unwind them
// using the C++ handling. This is a special case when the UNINSTALL_MANAGED_EXCEPTION_DISPATCHER was
// not at the managed to native boundary.
if (NativeExceptionHolderBase::FindNextHolder(nullptr, (void*)currentSP, (void*)firstManagedFrameSP) != nullptr)
{
break;
}
}
If it works as I expect, we can have it that way for all Unix platforms and save few unwind calls. I would then clean it up and create a PR for it.
@jkotas I have modified libunwind to correctly identify hardware exception handler and modified PAL_VirtualUnwind to keep calling unw_step()
reusing unw_cursor_t cursor
if the cursor
is at <signal handler called>
or at JIT_WriteBarrier
.
Then, it successfully passes JIT_WriteBarrier and goes to the JIT stack.
So, the problem in CoreCLR-ARM is that we are losing some context in cursor by converting back-and-forth between libunwind-unw_cursor_t and coreclr-CONTEXT.
How about including unw_cursor_t
in CONTEXT struct if it is Linux && ARM and use the unw_cursor_t if it is "valid (not null)"? It removes the need for conversions inside a unwind loop as well.
The (the condition is not correct. it works only for the unit test 143837 and demonstration only) tested patch is:
diff --git a/src/pal/src/exception/seh-unwind.cpp b/src/pal/src/exception/seh-unwind.cpp
index f6dd1d9..89ecd44 100644
--- a/src/pal/src/exception/seh-unwind.cpp
+++ b/src/pal/src/exception/seh-unwind.cpp
@@ -289,7 +289,22 @@ BOOL PAL_VirtualUnwind(CONTEXT *context, KNONVOLATILE_CONTEXT_POINTERS *contextP
// in the context flags.
if (unw_is_signal_frame(&cursor) > 0)
{
- context->ContextFlags |= CONTEXT_EXCEPTION_ACTIVE;
+ printf("%s:%d SKIPPING signal frame (double unwind) PC\n", __func__, __LINE__);
+ st = unw_step(&cursor); // unwind again to skip this.
+ UnwindContextToWinContext(&cursor, context);
+ printf("Skipped to %8.8x PC\n", (unsigned int) CONTEXTGetPC(context));
+ st = unw_step(&cursor); // unwind again to skip this.
+ UnwindContextToWinContext(&cursor, context);
+ printf("Skipped again to %8.8x PC\n", (unsigned int) CONTEXTGetPC(context));
+ if (st < 0)
+ {
+ return FALSE;
+ }
+ if (unw_is_signal_frame(&cursor) > 0) {
+ context->ContextFlags |= CONTEXT_EXCEPTION_ACTIVE;
+ } else {
+ context->ContextFlags &= ~CONTEXT_EXCEPTION_ACTIVE;
+ }
}
else
{
Or.. with less modification on the logic (reusing unw_cursor_t
only when converting seems not working), we may make:
BOOL PAL_VirtualUnwind(CONTEXT *context, KNONVOLATILE_CONTEXT_POINTERS *contextPointers, int repeat = 0)
and let PAL_VirtualUnwind
call unw_step()
repeat + 1
times.
Then, at the loop of VirtualUnwindToFirstManagedCallFrame
, we may workaround it with help from IsIPInMarkedJitHelper()
(once it is detected to be MarkedJitHelper, backtrack the backtracking of the context and double-unwind.)
I'll test this method too.
OK. allowing PAL_VirtualUnwind
to unwind repeatedly (PAL_VirtualUnwind(context, NULL, repeat_number);
) solves the issue. I'll start commiting after more testing:
How stackwalk.cpp
changes:
diff --git a/src/vm/stackwalk.cpp b/src/vm/stackwalk.cpp
index 3c2dfd7..55f2f89 100644
--- a/src/vm/stackwalk.cpp
+++ b/src/vm/stackwalk.cpp
@@ -770,6 +770,9 @@ UINT_PTR Thread::VirtualUnwindToFirstManagedCallFrame(T_CONTEXT* pContext)
CONTRACTL_END;
PCODE uControlPc = GetIP(pContext);
+#ifdef _ARM_
+ T_CONTEXT pContextCallee;
+#endif
// unwind out of this function and out of our caller to
// get our caller's PSP, or our caller's caller's SP.
@@ -778,7 +781,10 @@ UINT_PTR Thread::VirtualUnwindToFirstManagedCallFrame(T_CONTEXT* pContext)
#ifndef FEATURE_PAL
uControlPc = VirtualUnwindCallFrame(pContext);
#else // !FEATURE_PAL
- BOOL success = PAL_VirtualUnwind(pContext, NULL);
+#ifdef _ARM_
+ pContextCallee = *pContext;
+#endif
+ BOOL success = PAL_VirtualUnwind(pContext, NULL, 0);
if (!success)
{
_ASSERTE(!"Thread::VirtualUnwindToFirstManagedCallFrame: PAL_VirtualUnwind failed");
@@ -791,8 +797,35 @@ UINT_PTR Thread::VirtualUnwindToFirstManagedCallFrame(T_CONTEXT* pContext)
{
break;
}
+#ifdef _ARM_
+ if (IsIPInMarkedJitHelper(uControlPc))
+ {
+ // JIT_WriteBarrir does not support libunwind or VirtualUnwindLeafCallFrame
+ // It did not save unwind information while we got deeper by signal handler
+ // We need to unwind manually from its callee. damn...
+ *pContext = pContextCallee;
+ printf("%s:%d un-unwind PC %8.8x \n", __func__, __LINE__, uControlPc);
+ uControlPc = GetIP(pContext);
+ printf("to %8.8x \n", uControlPc);
+
+ //uControlPc = VirtualUnwindLeafCallFrame(pContext);
+
+ BOOL success = PAL_VirtualUnwind(pContext, NULL, 1);
+ if (!success)
+ {
+ _ASSERTE(!"Thread::VirtualUnwindToFirstManagedCallFrame: PAL_VirtualUnwind failed");
+ EEPOLICY_HANDLE_FATAL_ERROR(COR_E_EXECUTIONENGINE);
+ }
+
+ uControlPc = GetIP(pContext);
+
+ // Unwind based on the CFI information of GLIBC / sigresgorer
+ printf("%s:%d to PC %8.8x\n", __func__, __LINE__, uControlPc);
+ }
+#endif // _ARM_
#endif // !FEATURE_PAL
}
+ printf("%s:%d Returning PC %8.8x\n", __func__, __LINE__, uControlPc);
return uControlPc;
}
diff --git a/src/pal/src/exception/seh-unwind.cpp b/src/pal/src/exception/seh-unwind.cpp
index f6dd1d9..987f83e 100644
--- a/src/pal/src/exception/seh-unwind.cpp
+++ b/src/pal/src/exception/seh-unwind.cpp
@@ -223,7 +223,7 @@ static void GetContextPointers(unw_cursor_t *cursor, unw_context_t *unwContext,
#endif
}
-BOOL PAL_VirtualUnwind(CONTEXT *context, KNONVOLATILE_CONTEXT_POINTERS *contextPointers)
+BOOL PAL_VirtualUnwind(CONTEXT *context, KNONVOLATILE_CONTEXT_POINTERS *contextPointers, int repeat)
{
int st;
unw_context_t unwContext;
@@ -278,18 +278,39 @@ BOOL PAL_VirtualUnwind(CONTEXT *context, KNONVOLATILE_CONTEXT_POINTERS *contextP
curPc = CONTEXTGetPC(context);
#endif
- st = unw_step(&cursor);
- if (st < 0)
- {
- return FALSE;
- }
+ do {
+ st = unw_step(&cursor);
+ if (st < 0)
+ {
+ return FALSE;
+ }
+ } while ((repeat--) > 0);
Not that out of GDB, it still gets segmentation faults due to excessive printf calls.
Hmm.. even without all clutters (printfs), it does not work without GDB (sometimes it works and returns 100)
bash-3.2# ./corerun 143837.exe
Segmentation fault (core dumped)
bash-3.2# ./corerun 143837.exe
caught expected exception System.NullReferenceException
caught expected exception System.NullReferenceException
caught expected exception System.NullReferenceException
caught expected exception System.NullReferenceException
caught expected exception System.NullReferenceException
caught expected exception System.NullReferenceException
Pass
bash-3.2# echo $?
100
bash-3.2# ./corerun 143837.exe
Segmentation fault (core dumped)
bash-3.2# ./corerun 143837.exe
Segmentation fault (core dumped)
bash-3.2#
bash-3.2# ./corerun 143837.exe
Segmentation fault (core dumped)
bash-3.2# ./corerun 143837.exe
Segmentation fault (core dumped)
In GDB:
(gdb) run
Starting program: /sdcard/unittest/corerun 143837.exe
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/libthread_db.so.1".
[New Thread 0xb6152450 (LWP 8212)]
[New Thread 0xb58f2450 (LWP 8213)]
[New Thread 0xb4eff450 (LWP 8214)]
[New Thread 0xb24fe450 (LWP 8215)]
Program received signal SIGSEGV, Segmentation fault.
0xb650b0d4 in JIT_WriteBarrier () at patchedcode.S:413
413 patchedcode.S: No such file or directory.
(gdb)
(gdb) c
Continuing.
caught expected exception System.NullReferenceException
Program received signal SIGSEGV, Segmentation fault.
0xb1adbb0e in ?? ()
(gdb) c
Continuing.
caught expected exception System.NullReferenceException
Program received signal SIGSEGV, Segmentation fault.
0xb1adba76 in ?? ()
(gdb) c
Continuing.
caught expected exception System.NullReferenceException
Program received signal SIGSEGV, Segmentation fault.
0xb650b0d4 in JIT_WriteBarrier () at patchedcode.S:413
413 in patchedcode.S
(gdb) c
Continuing.
caught expected exception System.NullReferenceException
Program received signal SIGSEGV, Segmentation fault.
0xb1adeb96 in ?? ()
(gdb) c
Continuing.
caught expected exception System.NullReferenceException
Program received signal SIGSEGV, Segmentation fault.
0xb1adeafe in ?? ()
(gdb) c
Continuing.
caught expected exception System.NullReferenceException
Pass
[Thread 0xb24fe450 (LWP 8215) exited]
[Thread 0xb4eff450 (LWP 8214) exited]
[Thread 0xb58f2450 (LWP 8213) exited]
[Thread 0xb6152450 (LWP 8212) exited]
[Inferior 1 (process 8205) exited with code 0144]
(gdb) c
The program is not being run.
(gdb)
The current guess on "Working-with-GDB. Not-working-without-GDB" is that we might be using asynch-signal UNsafe functions in the context triggered by SIGSEGV from JIT_WriteBarrier.
@myungjoo Could you please try the solution I have suggested? You can give it a quick try first, just by adding this to the beginning of the "try" block of the DispatchManagedException after the CONTEXT frameContext:
frameContext = ex.ContextRecord;
if (IsIPInMarkedJitHelper(GetIP(&frameContext)))
{
frameContext.ContextFlags |= CONTEXT_EXCEPTION_ACTIVE;
PAL_VirtualUnwind(&frameContext, NULL);
<<<< Set a breakpoint here >>>>
}
When it hits the breakpoint, see if the frameContext contains a correctly unwound context.
I also have a question - in case of the hardware exception - is the LR register value in the ex.ContextRecord valid?
@janvorli Your suggestion of https://github.com/dotnet/coreclr/issues/4859#issuecomment-219689649 (https://github.com/myungjoo/coreclr/commit/f9bd93d1c126c880712c24b66211850c270f49dc) is working as mine (https://github.com/myungjoo/coreclr/commit/3591fd67fd78a258cc79bf1f0fb27a7125df1497), even the unwanted effect is the same.
It mostly fails without GDB and it (always) passes with GDB.
bash-3.2# ./corerun 143837.exe
Segmentation fault (core dumped)
bash-3.2# ./corerun 143837.mod.exe
Starting 1...
GOT the exception 1
Starting 2...
Segmentation fault (core dumped)
bash-3.2# ./corerun 143837.exe
caught expected exception System.NullReferenceException
caught expected exception System.NullReferenceException
caught expected exception System.NullReferenceException
caught expected exception System.NullReferenceException
caught expected exception System.NullReferenceException
caught expected exception System.NullReferenceException
Pass
bash-3.2# ./corerun 143837.exe
Segmentation fault (core dumped)
bash-3.2#
And it shows the correct Lr:
(gdb) break src/vm/exceptionhandling.cpp:4648
Breakpoint 1 at 0xb64f7e88: file /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/exceptionhandling.cpp, line 4648.
(gdb) c
Continuing.
Breakpoint 1, DispatchManagedException (ex=..., isHardwareException=true) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/exceptionhandling.cpp:4648
4648 frameContext = ex.ContextRecord;
(gdb) n
4649 frameContext.ContextFlags |= CONTEXT_EXCEPTION_ACTIVE;
(gdb) p/x frameContext
$1 = {ContextFlags = 0x200007, R0 = 0x4, R1 = 0xb250d86c, R2 = 0xbeffec98, R3 = 0xb650b311, R4 = 0xb1adb039, R5 = 0xbefff300, R6 = 0xbefff1e0, R7 = 0xbeffece8, R8 = 0xb6d2c000, R9 = 0xefca5100,
R10 = 0xb6d2c000, R11 = 0xbeffecd8, R12 = 0xb6d2c3b0, Sp = 0xbeffeca8, Lr = 0xb1adb8d7, Pc = 0xb650b314, Cpsr = 0xb0030, Fpscr = 0x80000010, Padding = 0xbeffea78, {Q = {{Low = 0x20007200650076,
High = 0x63006e00450038}, {Low = 0x6e00690064006f, High = 0x650047002e0067}, {Low = 0x74007900420074, High = 0x20005d00730065}, {Low = 0x65007000780045, High = 0x64006500740063}, {Low = 0x0,
High = 0x0}, {Low = 0x0, High = 0x0}, {Low = 0x0, High = 0x0}, {Low = 0x0, High = 0x0}, {Low = 0x4032efc92935df92, High = 0x4059000000000000}, {Low = 0x3ff4cccccccccccd, High = 0x0}, {
Low = 0x0, High = 0x0}, {Low = 0x0, High = 0x0}, {Low = 0x0, High = 0x0}, {Low = 0x0, High = 0x0}, {Low = 0x0, High = 0x0}, {Low = 0x0, High = 0x0}}, D = {0x20007200650076, 0x63006e00450038,
0x6e00690064006f, 0x650047002e0067, 0x74007900420074, 0x20005d00730065, 0x65007000780045, 0x64006500740063, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x4032efc92935df92, 0x4059000000000000,
0x3ff4cccccccccccd, 0x0 <repeats 13 times>}, S = {0x650076, 0x200072, 0x450038, 0x63006e, 0x64006f, 0x6e0069, 0x2e0067, 0x650047, 0x420074, 0x740079, 0x730065, 0x20005d, 0x780045, 0x650070,
0x740063, 0x640065, 0x0 <repeats 16 times>}}, Bvr = {0xbeffeab0, 0xbefff42c, 0x1, 0xbeffea78, 0x0, 0xffe8b4, 0x0, 0x1047360}, Bcr = {0x47360, 0xbeffeaac, 0xbeffeb74, 0x0, 0xb64bd389,
0xb250d988, 0xbeffeae0, 0xffeac0}, Wvr = {0xbeffeab0}, Wcr = {0x0}, Padding2 = {0x47360, 0xb250d988}}
(gdb) bt
#0 DispatchManagedException (ex=..., isHardwareException=true) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/exceptionhandling.cpp:4649
dotnet/coreclr#1 0xb64efdf6 in HandleHardwareException (ex=0xbeffe3e0) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/vm/exceptionhandling.cpp:5114
dotnet/coreclr#2 0xb6a8fd3a in SEHProcessException (pointers=0xbeffe670) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/seh.cpp:243
dotnet/coreclr#3 0xb6a91972 in common_signal_handler (code=11, siginfo=0xbeffe938, sigcontext=0xbeffe9b8, numParams=2) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/signal.cpp:597
dotnet/coreclr#4 0xb6a90d94 in sigsegv_handler (code=11, siginfo=0xbeffe938, context=0xbeffe9b8) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/signal.cpp:266
dotnet/coreclr#5 <signal handler called>
dotnet/coreclr#6 0xb650b314 in JIT_WriteBarrier () at patchedcode.S:413
dotnet/coreclr#7 0xb1adb8d6 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)
ps. I'd need to test without libunwind-arm signal-handler patch later as well.
Syscalls from the SIGSEGV handler of coreclr-arm:
Potentially dangerous behaviors in sigsegv_handler():
(gdb) c
Continuing.
GOT the exception 1
Starting 2...
Program received signal SIGSEGV, Segmentation fault.
0xb650afd4 in JIT_WriteBarrier () at patchedcode.S:413
413 patchedcode.S: No such file or directory.
(gdb) break printf
Breakpoint 5 at 0xb6d8ba12: file printf.c, line 28.
(gdb) break malloc
Breakpoint 6 at 0xb6da7e88: malloc. (2 locations)
(gdb) c
Continuing.
Breakpoint 6, __GI___libc_malloc (bytes=360) at malloc.c:2875
2875 {
(gdb) bt
#0 __GI___libc_malloc (bytes=360) at malloc.c:2875
dotnet/coreclr#1 0xb6d9b2e8 in __fopen_internal (filename=0xb6faa128 "/proc/self/maps", mode=0xb6faa124 "rce", is32=1) at iofopen.c:73
dotnet/coreclr#2 0xb6d9b34e in _IO_new_fopen (filename=<optimized out>, mode=<optimized out>) at iofopen.c:103
dotnet/coreclr#3 0xb6fa2a2e in pthread_getattr_np (thread_id=3070206544, attr=0xbeffe358) at pthread_getattr_np.c:77
dotnet/coreclr#4 0xb6b36614 in CorUnix::CPalThread::GetStackLimit () at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/thread/thread.cpp:2674
dotnet/coreclr#5 0xb6a8f9a6 in SEHProcessException (pointers=0xbeffe670) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/seh.cpp:229
dotnet/coreclr#6 0xb6a91632 in common_signal_handler (code=11, siginfo=0xbeffe938, sigcontext=0xbeffe9b8, numParams=2) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/signal.cpp:597
dotnet/coreclr#7 0xb6a90a54 in sigsegv_handler (code=11, siginfo=0xbeffe938, context=0xbeffe9b8) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/signal.cpp:266
dotnet/coreclr#8 <signal handler called>
dotnet/coreclr#9 0xb650afd4 in JIT_WriteBarrier () at patchedcode.S:413
dotnet/coreclr#10 0xb1cf08d6 in ?? ()
...
Breakpoint 6, __GI___libc_malloc (bytes=120) at malloc.c:2875
2875 {
(gdb) bt
#0 __GI___libc_malloc (bytes=120) at malloc.c:2875
dotnet/coreclr#1 0xb6d9bb94 in _IO_getdelim (lineptr=lineptr@entry=0xbeffe290, n=n@entry=0xbeffe294, delimiter=delimiter@entry=10, fp=fp@entry=0xc9dd8) at iogetdelim.c:66
dotnet/coreclr#2 0xb6fa2a8a in pthread_getattr_np (thread_id=3070206544, attr=0xbeffe358) at pthread_getattr_np.c:112
dotnet/coreclr#3 0xb6b36614 in CorUnix::CPalThread::GetStackLimit () at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/thread/thread.cpp:2674
dotnet/coreclr#4 0xb6a8f9a6 in SEHProcessException (pointers=0xbeffe670) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/seh.cpp:229
dotnet/coreclr#5 0xb6a91632 in common_signal_handler (code=11, siginfo=0xbeffe938, sigcontext=0xbeffe9b8, numParams=2) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/signal.cpp:597
dotnet/coreclr#6 0xb6a90a54 in sigsegv_handler (code=11, siginfo=0xbeffe938, context=0xbeffe9b8) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/signal.cpp:266
dotnet/coreclr#7 <signal handler called>
dotnet/coreclr#8 0xb650afd4 in JIT_WriteBarrier () at patchedcode.S:413
dotnet/coreclr#9 0xb1cf08d6 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)
Breakpoint 6, __GI___libc_malloc (bytes=bytes@entry=32) at malloc.c:2875
2875 {
(gdb) bt
#0 __GI___libc_malloc (bytes=bytes@entry=32) at malloc.c:2875
dotnet/coreclr#1 0xb6da859e in __GI___libc_realloc (oldmem=0x0, bytes=32) at malloc.c:2971
dotnet/coreclr#2 0xb6fa29b6 in pthread_getattr_np (thread_id=3070206544, attr=0xbeffe358) at pthread_getattr_np.c:163
dotnet/coreclr#3 0xb6b36614 in CorUnix::CPalThread::GetStackLimit () at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/thread/thread.cpp:2674
dotnet/coreclr#4 0xb6a8f9a6 in SEHProcessException (pointers=0xbeffe670) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/seh.cpp:229
dotnet/coreclr#5 0xb6a91632 in common_signal_handler (code=11, siginfo=0xbeffe938, sigcontext=0xbeffe9b8, numParams=2) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/signal.cpp:597
dotnet/coreclr#6 0xb6a90a54 in sigsegv_handler (code=11, siginfo=0xbeffe938, context=0xbeffe9b8) at /source/tizen_3.0/dotnet/coreclr_upstreaming/src/pal/src/exception/signal.cpp:266
dotnet/coreclr#7 <signal handler called>
dotnet/coreclr#8 0xb650afd4 in JIT_WriteBarrier () at patchedcode.S:413
dotnet/coreclr#9 0xb1cf08d6 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
@myungjoo there is no problem in calling the unsafe syscalls, malloc and other function from the syscall handler as long as we can guarantee that the faulting code was running in the jitted code or in the few helpers. I have recently reviewed and fixed the hardware exception handling path to get rid of all such calls until we know we were running the safe code. But I could have missed something.
@janvori hmm.. then, I should search for other possibilities as well.
Anyway, in common_signal_handler()
, in case of error (not running on top of GDB), it returns from SEHProcessException()
and goes to PROCNotifyProcessShutdown()
of sigsegv_handler()
.
It is going to consume more time because I cannot use GDB for the bug that appears only when GDB is not used. :(
Without running on top of GDB, the if (g_safeExceptionCheckFunction(pointers->ContextRecord, pointers->ExceptionRecord))
condition of SEHProcessException()
sometimes becomes false
, which makes it unable to handle the SIGSEGV.
It seems that JIT/Regression/CLR-x86-JIT/V1-M09.5-PDC/b16335/b16335/b16335
testcase failure is also related with this issue:
(gdb) r
Starting program: /mnt/dotnet-latest/Linux.arm.Debug/corerun -c dotnet-latest/Linux.arm.Debug/ ut/JIT/Regression/CLR-x86-JIT/V1-M09.5-PDC/b16335/b16335/b16335.exe
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
Program received signal SIGSEGV, Segmentation fault.
0x7676ed64 in JIT_WriteBarrier () at patchedcode.S:413
413 patchedcode.S: No such file or directory.
(gdb) bt
#0 0x7676ed64 in JIT_WriteBarrier () at patchedcode.S:413
dotnet/coreclr#1 0x71dda4e6 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) bt
#0 0x7676ed64 in JIT_WriteBarrier () at patchedcode.S:413
dotnet/coreclr#1 0x71dda4e6 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) c
Continuing.
Program received signal SIGSEGV, Segmentation fault.
0x763f2a62 in ?? () from /lib/arm-linux-gnueabihf/libunwind-arm.so.8
(gdb) bt
#0 0x763f2a62 in ?? () from /lib/arm-linux-gnueabihf/libunwind-arm.so.8
dotnet/coreclr#1 0x763f350c in _Uarm_step () from /lib/arm-linux-gnueabihf/libunwind-arm.so.8
dotnet/coreclr#2 0x76b184fe in PAL_VirtualUnwind (context=0x7effd9c0, contextPointers=0x0) at /home/parjong/projects/toolchains/coreclr/src/pal/src/exception/seh-unwind.cpp:281
dotnet/coreclr#3 0x765d37bc in Thread::VirtualUnwindToFirstManagedCallFrame (pContext=0x7effd9c0) at /home/parjong/projects/toolchains/coreclr/src/vm/stackwalk.cpp:781
dotnet/coreclr#4 0x7675b980 in DispatchManagedException (ex=...) at /home/parjong/projects/toolchains/coreclr/src/vm/exceptionhandling.cpp:4649
dotnet/coreclr#5 0x767538e4 in HandleHardwareException (ex=0x7effdfe8) at /home/parjong/projects/toolchains/coreclr/src/vm/exceptionhandling.cpp:5103
dotnet/coreclr#6 0x76b181ee in SEHProcessException (pointers=0x7effe278) at /home/parjong/projects/toolchains/coreclr/src/pal/src/exception/seh.cpp:243
dotnet/coreclr#7 0x76b19ec6 in common_signal_handler (code=11, siginfo=0x7effe570, sigcontext=0x7effe5f0, numParams=2) at /home/parjong/projects/toolchains/coreclr/src/pal/src/exception/signal.cpp:617
dotnet/coreclr#8 0x76b19284 in sigsegv_handler (code=11, siginfo=0x7effe570, context=0x7effe5f0) at /home/parjong/projects/toolchains/coreclr/src/pal/src/exception/signal.cpp:285
dotnet/coreclr#9 0x71a1a9d2 in sigsegv_handler (code=11, siginfo=0x7effe570, context=0x7effe5f0) at /home/parjong/projects/toolchains/coreclr/src/pal/src/exception/signal.cpp:290
dotnet/coreclr#10 <signal handler called>
dotnet/coreclr#11 0x7676ed64 in JIT_WriteBarrier () at patchedcode.S:413
dotnet/coreclr#12 0x71dda4e6 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)
@parjong If SIGSEGV is coming from libunwind, it is not the bug of this one although it might only mean that you don't have the libunwind fixes included. (will try to push libunwind to merge the patches soon again along with additional libunwind fixes)
@janvorli IsSafeToHandleHardwareException is behaving strangely.
With GDB, it always thinks that the corresponding exception is coming from JitHelper (IsIPInMarkedJitHelper()
is true
)
Without GDB, it often (more than 50%) thinks that it is `false'.
I'll verify if the input PC is really JitHelper (write barrier) or not when it is not on GDB (with GDB, it is always).
OK. Looks like I found what is causing it. With GDB, JIT_WriteBarrier has dmb sy
before str r1, [r0, #0]
offsetting +4 and without GDB, str r1, [r0, #0]
is the first instruction and the offset is 0, which confuses IsIPInMarkedJitHelper
. I'll write and show a fix soon.
The constants used by IsIPInMarkedJitHelper
are pointing to THUMB instruction, which has the least bit set to denote that they are THUMB. However, their real locations do not have their least bit set.
For example, if the real location is at 100, the location constant is 101, which makes the condition statement of CHECK_RANGE
errornous. However, with GDB, JIT thinks it has additional threads and emits dmb
, which makes the errornous condition work ( +4 > +1).
To be safe, for all ARM/THUMB PC and values compared with PC should reset its least bit. (they are aligned by 16bit anyway)
@parjong note that, after this coreclr fix, libunwind-arm without any fixes suggested by me works for coreclr anyway.
The fixes for libunwind-arm are valid, but after the fix for coreclr, we do not visit (or aren't affected by) such corner cases of libunwind-arm.
The CoreCLR Unittest of
JIT/jit64/gc/regress/vswhidbey/143837
fails at ARM/Linux:I'll update as soon as some useful data comes out.