dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.47k stars 4.76k forks source link

[ARM] Intermittent segfaults in JIT/Methodical/cctor/misc/threads1_cs_r #8391

Closed mskvortsov closed 4 years ago

mskvortsov commented 7 years ago

One of the possible stack traces that gdb shows after loading a core dump:

$ gdb clr-debug/corerun core
Reading symbols from clr-debug/corerun...done.
[New LWP 22222]
[New LWP 22214]
[New LWP 22216]
[New LWP 22218]
[New LWP 22220]
[New LWP 22215]
[New LWP 22219]
[New LWP 22223]
[New LWP 22221]
[New LWP 22217]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Core was generated by `clr-debug/corerun tests-release/JIT/Methodical/cctor/misc/threads1_cs_r/threads'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000 in ?? ()
[Current thread is 1 (Thread 0xadbff450 (LWP 22222))]
(gdb) bt
#0  0x00000000 in ?? ()
dotnet/coreclr#1  0xb650a34c in DomainLocalModule::GetPrecomputedNonGCStaticsBasePointer (this=0xb1c2a594) at /home/mskvortsov/git/coreclr/src/vm/appdomain.hpp:234
dotnet/coreclr#2  0xb648c48c in CallDescrWorker (pCallDescrData=0xadbfe4c0) at /home/mskvortsov/git/coreclr/src/vm/callhelpers.cpp:135
dotnet/coreclr#3  0xb648c32e in CallDescrWorkerWithHandler (pCallDescrData=0xadbfe4c0, fCriticalCall=0) at /home/mskvortsov/git/coreclr/src/vm/callhelpers.cpp:78
dotnet/coreclr#4  0xb648d374 in MethodDescCallSite::CallTargetWorker (this=0xadbfe62c, pArguments=0xadbfe6a0, pReturnValue=0x0, cbReturnValue=0) at /home/mskvortsov/git/coreclr/src/vm/callhelpers.cpp:645
dotnet/coreclr#5  0xb637b4ee in MethodDescCallSite::Call (this=0xadbfe62c, pArguments=0xadbfe6a0) at /home/mskvortsov/git/coreclr/src/vm/callhelpers.h:433
dotnet/coreclr#6  0xb649a438 in ThreadNative::KickOffThread_Worker (ptr=0xadbfeab8) at /home/mskvortsov/git/coreclr/src/vm/comsynchronizable.cpp:257
dotnet/coreclr#7  0xb644b07c in ManagedThreadBase_DispatchInner (pCallState=0xadbfe93c) at /home/mskvortsov/git/coreclr/src/vm/threads.cpp:9187
dotnet/coreclr#8  0xb644e8a8 in ManagedThreadBase_DispatchMiddle (pCallState=0xadbfe93c) at /home/mskvortsov/git/coreclr/src/vm/threads.cpp:9238
dotnet/coreclr#9  0xb644e760 in ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)::$_6::operator()(ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)::TryArgs*) const::{lambda(Param*)#1}::operator()(Param*) const (this=0xadbfe884, pParam=0xadbfe8e8) at /home/mskvortsov/git/coreclr/src/vm/threads.cpp:9476
dotnet/coreclr#10 0xb644e620 in ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)::$_6::operator()(ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)::TryArgs*) const (this=0xadbfe8d0, pArgs=0xadbfe8d8) at /home/mskvortsov/git/coreclr/src/vm/threads.cpp:9478
dotnet/coreclr#11 0xb644adb0 in ManagedThreadBase_DispatchOuter (pCallState=0xadbfe93c) at /home/mskvortsov/git/coreclr/src/vm/threads.cpp:9515
dotnet/coreclr#12 0xb644aee6 in ManagedThreadBase_FullTransitionWithAD (pAppDomain=..., pTarget=0xb649a231 <ThreadNative::KickOffThread_Worker(void*)>, args=0xadbfeab8, filterType=ManagedThread) at /home/mskvortsov/git/coreclr/src/vm/threads.cpp:9536
dotnet/coreclr#13 0xb644ae6e in ManagedThreadBase::KickOff (pAppDomain=..., pTarget=0xb649a231 <ThreadNative::KickOffThread_Worker(void*)>, args=0xadbfeab8) at /home/mskvortsov/git/coreclr/src/vm/threads.cpp:9571
dotnet/coreclr#14 0xb649a998 in ThreadNative::KickOffThread (pass=0x1063f0) at /home/mskvortsov/git/coreclr/src/vm/comsynchronizable.cpp:376
dotnet/coreclr#15 0xb6441e1c in Thread::intermediateThreadProc (arg=0xd4ec0) at /home/mskvortsov/git/coreclr/src/vm/threads.cpp:2584
dotnet/coreclr#16 0xb69a2f02 in CorUnix::CPalThread::ThreadEntry (pvParam=0x10c048) at /home/mskvortsov/git/coreclr/src/pal/src/thread/thread.cpp:1749
dotnet/coreclr#17 0xb6f295b4 in start_thread (arg=0x0) at pthread_create.c:335
dotnet/coreclr#18 0xb6d1caac in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:89 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
mskvortsov commented 7 years ago

@dotnet/arm32-contrib Can someone reproduce this? I use a script like this:

$ cat repro.sh
#!/bin/sh
ulimit -c unlimited
#export COMPlus_AltJitAssertOnNYI=0
#export COMPlus_AltJit=*
try=0
while true
do
  try=$((try+1))
  clr-debug/corerun tests-release/JIT/Methodical/cctor/misc/threads1_cs_r/threads1_cs_r.exe >/dev/null 2>&1
  if [ $? = 100 ]
  then
    echo -n .
  else
    echo
    echo Failed on a try \#$try
    exit
  fi
done
$ ./repro.sh 
......................................
Failed on a try dotnet/runtime#3867
$ gdb clr-debug/corerun core
mskvortsov commented 7 years ago

And here is another kind of a stack trace I get:

$ ./repro.sh 
............................
Failed on a try dotnet/runtime#3861
$ gdb clr-debug/corerun core
Reading symbols from clr-debug/corerun...done.
[New LWP 26190]
[New LWP 26183]
[New LWP 26187]
[New LWP 26185]
[New LWP 26184]
[New LWP 26186]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Core was generated by `clr-debug/corerun tests-release/JIT/Methodical/cctor/misc/threads1_cs_r/threads'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  access_mem (as=0xb61bd130 <local_addr_space>, addr=4257793536, val=0xae3f918c, write=0, arg=0xae3fd224) at arm/Ginit.c:86
86  arm/Ginit.c: No such file or directory.
[Current thread is 1 (Thread 0xae3ff450 (LWP 26190))]
(gdb) bt
#0  access_mem (as=0xb61bd130 <local_addr_space>, addr=4257793536, val=0xae3f918c, write=0, arg=0xae3fd224) at arm/Ginit.c:86
dotnet/coreclr#1  0xb61a05b4 in dwarf_get (c=0xae3f9224, c=0xae3f9224, val=0xae3f918c, loc=...) at ../include/tdep-arm/libunwind_i.h:203
dotnet/coreclr#2  _Uarm_step (cursor=0xae3f9224) at arm/Gstep.c:233
dotnet/coreclr#3  0xb687e2fa in PAL_VirtualUnwind (context=0xae3fd640, contextPointers=0x0) at /home/mskvortsov/git/coreclr/src/pal/src/exception/seh-unwind.cpp:309
dotnet/coreclr#4  0xb654df5e in UnwindManagedExceptionPass1 (ex=..., frameContext=0xae3fd640) at /home/mskvortsov/git/coreclr/src/vm/exceptionhandling.cpp:4663
dotnet/coreclr#5  0xb654e3be in DispatchManagedException (ex=..., isHardwareException=true) at /home/mskvortsov/git/coreclr/src/vm/exceptionhandling.cpp:4752
dotnet/coreclr#6  0xb6546074 in HandleHardwareException (ex=0xae3fdb60) at /home/mskvortsov/git/coreclr/src/vm/exceptionhandling.cpp:5243
dotnet/coreclr#7  0xb687dfce in SEHProcessException (exception=0xae3fdb60) at /home/mskvortsov/git/coreclr/src/pal/src/exception/seh.cpp:283
dotnet/coreclr#8  0xb6880036 in common_signal_handler (code=11, siginfo=0xae503e40, sigcontext=0xae503ec0, numParams=2) at /home/mskvortsov/git/coreclr/src/pal/src/exception/signal.cpp:819
dotnet/coreclr#9  0xb687fe2c in signal_handler_worker (code=11, siginfo=0xae503e40, context=0xae503ec0, returnPoint=0xae503a50) at /home/mskvortsov/git/coreclr/src/pal/src/exception/signal.cpp:414
dotnet/coreclr#10 0xb6930b8e in CallSignalHandlerWrapper0 () at /home/mskvortsov/git/coreclr/src/pal/src/arch/arm/callsignalhandlerwrapper.S:31
janvorli commented 7 years ago

@mskvortsov regarding the 2nd dump, are you sure the failure point is in the thread whose stack you've dumped? My experience is that when you load a dump with SIGSEGV, the faulting thread is often different from the thread 1 that it shows by default. I usually have to do bt for all threads and then pinpoint the one that have caused the problem by looking at the current instruction's operands and checking if the related memory address is valid or not.

mskvortsov commented 7 years ago

@janvorli I didn't know the first thread gdb shows may differ from the failing one, thanks for pointing this out!

I have checked and can confirm the 2nd dump is indeed from the failing thread:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  access_mem (as=0xb61bd130 <local_addr_space>, addr=4257793536, val=0xae3f918c, write=0, arg=0xae3fd224) at arm/Ginit.c:86
86  arm/Ginit.c: No such file or directory.
[Current thread is 1 (Thread 0xae3ff450 (LWP 26190))]
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0xae3ff450 (LWP 26190) access_mem (as=0xb61bd130 <local_addr_space>, addr=4257793536, val=0xae3f918c, write=0, arg=0xae3fd224) at arm/Ginit.c:86
  2    Thread 0xb6f12000 (LWP 26183) __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  3    Thread 0xb43ff450 (LWP 26187) __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  4    Thread 0xb57bf450 (LWP 26185) __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  5    Thread 0xb613c450 (LWP 26184) 0xb6c9da50 in poll () at ../sysdeps/unix/syscall-template.S:84
  6    Thread 0xb4dff450 (LWP 26186) 0xb6eb7e4e in open () at ../sysdeps/unix/syscall-template.S:84
(gdb) disas
Dump of assembler code for function access_mem:
   0xb619faf0 <+0>: cbnz    r3, 0xb619fafa <access_mem+10>
=> 0xb619faf2 <+2>: ldr r3, [r1, #0]
   0xb619faf4 <+4>: movs    r0, #0
   0xb619faf6 <+6>: str r3, [r2, #0]
   0xb619faf8 <+8>: bx  lr
   0xb619fafa <+10>:    ldr r3, [r2, #0]
   0xb619fafc <+12>:    movs    r0, #0
   0xb619fafe <+14>:    str r3, [r1, #0]
   0xb619fb00 <+16>:    bx  lr
End of assembler dump.
(gdb) p/x $r1 
$1 = 0xfdc8c600
(gdb) p/x *$r1
Cannot access memory at address 0xfdc8c600
(gdb)
hqueue commented 7 years ago

@mskvortsov Unfortunately, I couldn't reproduce this one using your repro.sh on Rpi3 with last week CoreCLR. (commit a2ecf158bf2) until more than 100 trials(maybe).

$ ./repro.sh
..................................................................................................................................................................................................................................................................................................................

My rpi3 use Ubuntu mate (16.04) with glibc 2.23 and libunwind8(1.1-4.1) where access_mem is defined.

$ dpkg -l |grep libunwind8
ii  libunwind8                            1.1-4.1                                    armhf        library to determine the call-chain of a program - runtime
$ dpkg -l |grep libc6
ii  libc6:armhf                           2.23-0ubuntu9                              armhf        GNU C Library: Shared libraries
ii  libc6-dbg:armhf                       2.23-0ubuntu9                              armhf        GNU C Library: detached debugging symbols
ii  libc6-dev:armhf                       2.23-0ubuntu9                              armhf        GNU C Library: Development Libraries and Header Files
alpencolt commented 7 years ago

@janvorli I've reproduced this issue and it looks that crash occurred on Thread 1:

(gdb) i threads
  Id   Target Id         Frame 
* 1    Thread 0xaf7fd450 (LWP 17145) 0xb636aa68 in MethodTable::GetFlag (this=0x1000400, flag=MethodTable::enum_flag_HasComponentSize) at /home/alexander/src/coreclr/src/vm/methodtable.h:3979
  2    Thread 0xaedff450 (LWP 17146) __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  3    Thread 0xadbff450 (LWP 17148) __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  4    Thread 0xad1ff450 (LWP 17149) __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  5    Thread 0xb619a450 (LWP 17141) 0xb6cf7a50 in poll () at ../sysdeps/unix/syscall-template.S:84
  6    Thread 0xb4dff450 (LWP 17143) 0xb6f11e4e in open () at ../sysdeps/unix/syscall-template.S:84
  7    Thread 0xb45ff450 (LWP 17144) __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  8    Thread 0xb57ff450 (LWP 17142) __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  9    Thread 0xae3ff450 (LWP 17147) __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  10   Thread 0xb6f6c000 (LWP 17140) __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46

Almost all other threads is waiting on:

#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
dotnet/coreclr#1  0xb6f0edba in __pthread_cond_wait (cond=0xfd5e8, mutex=0xfd5d0) at pthread_cond_wait.c:186
dotnet/coreclr#2  0xb6972b4e in CorUnix::CPalSynchronizationManager::ThreadNativeWait (ptnwdNativeWaitData=0xfd5d0, dwTimeout=4294967295, ptwrWakeupReason=0xadbfcd58, pdwSignaledObject=0xadbfcd54)
    at /home/alexander/src/coreclr/src/pal/src/synchmgr/synchmanager.cpp:475
dotnet/coreclr#3  0xb6972036 in CorUnix::CPalSynchronizationManager::BlockThread (this=0x32330, pthrCurrent=0xfd438, dwTimeout=4294967295, fAlertable=false, fIsSleep=false, ptwrWakeupReason=0xadbfd074, pdwSignaledObject=0xadbfd09c)
    at /home/alexander/src/coreclr/src/pal/src/synchmgr/synchmanager.cpp:298
dotnet/coreclr#4  0xb697f920 in CorUnix::InternalWaitForMultipleObjectsEx (pThread=0xfd438, nCount=1, lpHandles=0xadbfd1a0, bWaitAll=0, dwMilliseconds=4294967295, bAlertable=0) at /home/alexander/src/coreclr/src/pal/src/synchmgr/wait.cpp:561
dotnet/coreclr#5  0xb6980128 in WaitForSingleObjectEx (hHandle=0xa8, dwMilliseconds=4294967295, bAlertable=0) at /home/alexander/src/coreclr/src/pal/src/synchmgr/wait.cpp:96
dotnet/coreclr#6  0xb65750c2 in CLREventWaitHelper2 (handle=0xa8, dwMilliseconds=4294967295, alertable=0) at /home/alexander/src/coreclr/src/vm/synch.cpp:385
dotnet/coreclr#7  0xb6574fb4 in CLREventWaitHelper(void*, unsigned int, int)::$_1::operator()(CLREventWaitHelper(void*, unsigned int, int)::Param*) const (this=0xadbfd254, pParam=0xadbfd25c) at /home/alexander/src/coreclr/src/vm/synch.cpp:411
dotnet/coreclr#8  0xb6574a32 in CLREventWaitHelper (handle=0xa8, dwMilliseconds=4294967295, alertable=0) at /home/alexander/src/coreclr/src/vm/synch.cpp:413
dotnet/coreclr#9  0xb657499a in CLREventBase::WaitEx (this=0x7c588, dwMilliseconds=4294967295, mode=WaitMode_None, syncState=0x0) at /home/alexander/src/coreclr/src/vm/synch.cpp:483
dotnet/coreclr#10 0xb6574844 in CLREventBase::Wait (this=0x7c588, dwMilliseconds=4294967295, alertable=0, syncState=0x0) at /home/alexander/src/coreclr/src/vm/synch.cpp:426
dotnet/coreclr#11 0xb6739772 in GCEvent::Impl::Wait (this=0x7c588, timeout=4294967295, alertable=false) at /home/alexander/src/coreclr/src/vm/gcenv.os.cpp:769
dotnet/coreclr#12 0xb6738f7c in GCEvent::Wait (this=0xb6c5cf78 <WKS::gc_heap::gc_done_event>, timeout=4294967295, alertable=false) at /home/alexander/src/coreclr/src/vm/gcenv.os.cpp:847
dotnet/coreclr#13 0xb6600116 in WKS::gc_heap::wait_for_gc_done (timeOut=-1) at /home/alexander/src/coreclr/src/gc/gc.cpp:10216
dotnet/coreclr#14 0xb660dcf4 in WKS::gc_heap::try_allocate_more_space (acontext=0xf63e8, size=16, gen_number=0) at /home/alexander/src/coreclr/src/gc/gc.cpp:13094
dotnet/coreclr#15 0xb660dec4 in WKS::gc_heap::allocate_more_space (acontext=0xf63e8, size=16, alloc_generation_number=0) at /home/alexander/src/coreclr/src/gc/gc.cpp:13465
dotnet/coreclr#16 0xb66363d8 in WKS::gc_heap::allocate (jsize=16, acontext=0xf63e8) at /home/alexander/src/coreclr/src/gc/gc.cpp:13496
dotnet/coreclr#17 0xb662e3a4 in WKS::GCHeap::Alloc (this=0x45af8, context=0xf63e8, size=16, flags=2) at /home/alexander/src/coreclr/src/gc/gc.cpp:34381
dotnet/coreclr#18 0xb64d0272 in Alloc (size=16, bFinalize=0, bContainsPointers=16777216) at /home/alexander/src/coreclr/src/vm/gchelpers.cpp:241
dotnet/coreclr#19 0xb64cec56 in AllocateArrayEx (arrayType=..., pArgs=0xadbfd834, dwNumArgs=1, bAllocateInLargeHeap=0, bDontSetAppDomain=0) at /home/alexander/src/coreclr/src/vm/gchelpers.cpp:583
dotnet/coreclr#20 0xb64cf876 in AllocateObjectArray (cElements=1, elementType=..., bAllocateInLargeHeap=0) at /home/alexander/src/coreclr/src/vm/gchelpers.cpp:979
dotnet/coreclr#21 0xb6647d8c in ThreadStaticHandleBucket::ThreadStaticHandleBucket (this=0xad2006f8, pNext=0x0, Size=1, pDomain=0x5add0) at /home/alexander/src/coreclr/src/vm/appdomain.cpp:615
dotnet/coreclr#22 0xb6647ef8 in ThreadStaticHandleTable::AllocateHandles (this=0xad2006d8, nRequested=1) at /home/alexander/src/coreclr/src/vm/appdomain.cpp:701
dotnet/coreclr#23 0xb6443ba4 in ThreadLocalBlock::AllocateStaticFieldObjRefPtrs (this=0xad200628, nRequested=1, ppLazyAllocate=0xad2006b0) at /home/alexander/src/coreclr/src/vm/threadstatics.cpp:335
dotnet/coreclr#24 0xb6443aca in ThreadLocalBlock::AllocateThreadStaticHandles (this=0xad200628, pModule=0xb595fb28, pThreadLocalModule=0xad2006a8) at /home/alexander/src/coreclr/src/vm/threadstatics.cpp:303
dotnet/runtime#3858 0xb644420a in ThreadStatics::AllocateAndInitTLM (index=..., pThreadLocalBlock=0xad200628, pModule=0xb595fb28) at /home/alexander/src/coreclr/src/vm/threadstatics.cpp:587
dotnet/runtime#3859 0xb64443ae in ThreadStatics::GetTLM (index=..., pModule=0xb595fb28) at /home/alexander/src/coreclr/src/vm/threadstatics.cpp:617
dotnet/coreclr#27 0xb6444416 in ThreadStatics::GetTLM (pMT=0xb5960e94) at /home/alexander/src/coreclr/src/vm/threadstatics.cpp:626
dotnet/runtime#3860 0xb64e6c80 in JIT_GetGCThreadStaticBase_Helper (pMT=0xb5960e94) at /home/alexander/src/coreclr/src/vm/jithelpers.cpp:1829
dotnet/runtime#3861 0xb64e714c in JIT_GetSharedGCThreadStaticBase (moduleDomainID=3046506992, dwClassDomainID=5) at /home/alexander/src/coreclr/src/vm/jithelpers.cpp:1929
dotnet/runtime#3862 0xaf83d41a in ?? ()

It look crash occurred on garbage collecting. I've got the SIGSEGV on:

GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "arm-linux-gnueabihf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from src/corerun...done.
[New LWP 4531]
[New LWP 4534]
[New LWP 4533]
[New LWP 4532]
[New LWP 4528]
[New LWP 4526]
[New LWP 4530]
[New LWP 4529]
[New LWP 4527]
[New LWP 4535]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Core was generated by `src/corerun ../mskvortsov/tests-release/JIT/Methodical/cctor/misc/threads1_cs_r'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0xb63bea68 in MethodTable::GetFlag (this=0x1000400, flag=MethodTable::enum_flag_HasComponentSize) at /home/alexander/src/coreclr/src/vm/methodtable.h:3979
3979    /home/alexander/src/coreclr/src/vm/methodtable.h: No such file or directory.
[Current thread is 1 (Thread 0xaf84d450 (LWP 4531))]
(gdb) bt
#0  0xb63bea68 in MethodTable::GetFlag (this=0x1000400, flag=MethodTable::enum_flag_HasComponentSize) at /home/alexander/src/coreclr/src/vm/methodtable.h:3979
dotnet/coreclr#1  0xb63cb8b0 in MethodTable::HasComponentSize (this=0x1000400) at /home/alexander/src/coreclr/src/vm/methodtable.h:1819
dotnet/coreclr#2  0xb6442d88 in MethodTable::GetComponentSize (this=0x1000400) at /home/alexander/src/coreclr/src/vm/methodtable.h:1841
dotnet/coreclr#3  0xb643d756 in MethodTable::SanityCheck (this=0x1000400) at /home/alexander/src/coreclr/src/vm/methodtable.cpp:7580
dotnet/coreclr#4  0xb64404ce in MethodTable::Validate (this=0x1000400) at /home/alexander/src/coreclr/src/vm/methodtable.cpp:9627
dotnet/coreclr#5  0xb6449102 in Object::ValidateInner (this=0xb1553654, bDeep=1, bVerifyNextHeader=1, bVerifySyncBlock=1) at /home/alexander/src/coreclr/src/vm/object.cpp:1733
dotnet/coreclr#6  0xb644864c in Object::Validate (this=0xb1553654, bDeep=1, bVerifyNextHeader=1, bVerifySyncBlock=1) at /home/alexander/src/coreclr/src/vm/object.cpp:1709
dotnet/coreclr#7  0xb660766e in GcInfoDecoder::ReportStackSlotToGC (this=0xaf84a1a4, spOffset=-32, spBase=GC_FRAMEREG_REL, gcFlags=0, pRD=0xaf84aaa0, flags=0, pCallBack=0xb6521829 <GcEnumObject(void*, OBJECTREF*, unsigned int)>, hCallBack=0xaf84b0c8)
    at /home/alexander/src/coreclr/src/vm/gcinfodecoder.cpp:1821
dotnet/coreclr#8  0xb6607e20 in GcInfoDecoder::ReportSlotToGC (this=0xaf84a1a4, slotDecoder=..., slotIndex=10, pRD=0xaf84aaa0, reportScratchSlots=false, inputFlags=0, pCallBack=0xb6521829 <GcEnumObject(void*, OBJECTREF*, unsigned int)>, 
    hCallBack=0xaf84b0c8) at /home/alexander/src/coreclr/src/inc/gcinfodecoder.h:665
dotnet/coreclr#9  0xb66062fa in GcInfoDecoder::EnumerateLiveSlots (this=0xaf84a1a4, pRD=0xaf84aaa0, reportScratchSlots=false, inputFlags=0, pCallBack=0xb6521829 <GcEnumObject(void*, OBJECTREF*, unsigned int)>, hCallBack=0xaf84b0c8)
    at /home/alexander/src/coreclr/src/vm/gcinfodecoder.cpp:934
dotnet/coreclr#10 0xb63cd790 in EECodeManager::EnumGcRefs (this=0x60528, pRD=0xaf84aaa0, pCodeInfo=0xaf84a958, flags=0, pCallBack=0xb6521829 <GcEnumObject(void*, OBJECTREF*, unsigned int)>, hCallBack=0xaf84b0c8, relOffsetOverride=4294967295)
    at /home/alexander/src/coreclr/src/vm/eetwain.cpp:5062
dotnet/coreclr#11 0xb6521d80 in GcStackCrawlCallBack (pCF=0xaf84a738, pData=0xaf84b0c8) at /home/alexander/src/coreclr/src/vm/gcenv.ee.common.cpp:280
dotnet/coreclr#12 0xb646dcc0 in Thread::MakeStackwalkerCallback (this=0xf51b8, pCF=0xaf84a738, pCallback=0xb65219a5 <GcStackCrawlCallBack(CrawlFrame*, void*)>, pData=0xaf84b0c8, uFramesProcessed=2)
    at /home/alexander/src/coreclr/src/vm/stackwalk.cpp:868
dotnet/coreclr#13 0xb646de62 in Thread::StackWalkFramesEx (this=0xf51b8, pRD=0xaf84aaa0, pCallback=0xb65219a5 <GcStackCrawlCallBack(CrawlFrame*, void*)>, pData=0xaf84b0c8, flags=34048, pStartFrame=0x0)
    at /home/alexander/src/coreclr/src/vm/stackwalk.cpp:949
dotnet/coreclr#14 0xb646e6e6 in Thread::StackWalkFrames (this=0xf51b8, pCallback=0xb65219a5 <GcStackCrawlCallBack(CrawlFrame*, void*)>, pData=0xaf84b0c8, flags=34048, pStartFrame=0x0) at /home/alexander/src/coreclr/src/vm/stackwalk.cpp:1032
dotnet/coreclr#15 0xb678a884 in ScanStackRoots (pThread=0xf51b8, fn=0xb6671fb1 <WKS::GCHeap::Promote(Object**, ScanContext*, unsigned int)>, sc=0xaf84b228) at /home/alexander/src/coreclr/src/vm/gcenv.ee.cpp:149
dotnet/coreclr#16 0xb678a604 in GCToEEInterface::GcScanRoots (fn=0xb6671fb1 <WKS::GCHeap::Promote(Object**, ScanContext*, unsigned int)>, condemned=0, max_gen=2, sc=0xaf84b228) at /home/alexander/src/coreclr/src/vm/gcenv.ee.cpp:178
dotnet/coreclr#17 0xb67c1d26 in GCScan::GcScanRoots (fn=0xb6671fb1 <WKS::GCHeap::Promote(Object**, ScanContext*, unsigned int)>, condemned=0, max_gen=2, sc=0xaf84b228) at /home/alexander/src/coreclr/src/gc/gcscan.cpp:155
dotnet/coreclr#18 0xb66666f6 in WKS::gc_heap::mark_phase (condemned_gen_number=0, mark_only_p=0) at /home/alexander/src/coreclr/src/gc/gc.cpp:19553
dotnet/coreclr#19 0xb6664550 in WKS::gc_heap::gc1 () at /home/alexander/src/coreclr/src/gc/gc.cpp:15367
dotnet/coreclr#20 0xb666ca9a in WKS::gc_heap::garbage_collect (n=0) at /home/alexander/src/coreclr/src/gc/gc.cpp:16915
dotnet/coreclr#21 0xb6660a1c in WKS::GCHeap::GarbageCollectGeneration (this=0x45b40, gen=0, reason=reason_alloc_soh) at /home/alexander/src/coreclr/src/gc/gc.cpp:35039
dotnet/coreclr#22 0xb6661dac in WKS::gc_heap::try_allocate_more_space (acontext=0xd5d70, size=16, gen_number=0) at /home/alexander/src/coreclr/src/gc/gc.cpp:13169
dotnet/coreclr#23 0xb6661ec4 in WKS::gc_heap::allocate_more_space (acontext=0xd5d70, size=16, alloc_generation_number=0) at /home/alexander/src/coreclr/src/gc/gc.cpp:13465
dotnet/coreclr#24 0xb668a3d8 in WKS::gc_heap::allocate (jsize=16, acontext=0xd5d70) at /home/alexander/src/coreclr/src/gc/gc.cpp:13496
dotnet/runtime#3858 0xb66823a4 in WKS::GCHeap::Alloc (this=0x45b40, context=0xd5d70, size=16, flags=2) at /home/alexander/src/coreclr/src/gc/gc.cpp:34381
dotnet/runtime#3859 0xb6524272 in Alloc (size=16, bFinalize=0, bContainsPointers=16777216) at /home/alexander/src/coreclr/src/vm/gchelpers.cpp:241
dotnet/coreclr#27 0xb6522c56 in AllocateArrayEx (arrayType=..., pArgs=0xaf84b96c, dwNumArgs=1, bAllocateInLargeHeap=0, bDontSetAppDomain=0) at /home/alexander/src/coreclr/src/vm/gchelpers.cpp:583
dotnet/runtime#3860 0xb6523876 in AllocateObjectArray (cElements=1, elementType=..., bAllocateInLargeHeap=0) at /home/alexander/src/coreclr/src/vm/gchelpers.cpp:979
dotnet/runtime#3861 0xb669bd8c in ThreadStaticHandleBucket::ThreadStaticHandleBucket (this=0xaef006b0, pNext=0x0, Size=1, pDomain=0x5ae00) at /home/alexander/src/coreclr/src/vm/appdomain.cpp:615
dotnet/runtime#3862 0xb669bef8 in ThreadStaticHandleTable::AllocateHandles (this=0xaef00fa0, nRequested=1) at /home/alexander/src/coreclr/src/vm/appdomain.cpp:701
dotnet/runtime#3863 0xb6497ba4 in ThreadLocalBlock::AllocateStaticFieldObjRefPtrs (this=0xaef082a0, nRequested=1, ppLazyAllocate=0xaef005d8) at /home/alexander/src/coreclr/src/vm/threadstatics.cpp:335
dotnet/coreclr#32 0xb6497aca in ThreadLocalBlock::AllocateThreadStaticHandles (this=0xaef082a0, pModule=0xb59b3b28, pThreadLocalModule=0xaef005d0) at /home/alexander/src/coreclr/src/vm/threadstatics.cpp:303
dotnet/runtime#3864 0xb649820a in ThreadStatics::AllocateAndInitTLM (index=..., pThreadLocalBlock=0xaef082a0, pModule=0xb59b3b28) at /home/alexander/src/coreclr/src/vm/threadstatics.cpp:587
dotnet/runtime#3865 0xb64983ae in ThreadStatics::GetTLM (index=..., pModule=0xb59b3b28) at /home/alexander/src/coreclr/src/vm/threadstatics.cpp:617
dotnet/coreclr#35 0xb6498416 in ThreadStatics::GetTLM (pMT=0xb59b4e94) at /home/alexander/src/coreclr/src/vm/threadstatics.cpp:626
dotnet/coreclr#36 0xb653ac80 in JIT_GetGCThreadStaticBase_Helper (pMT=0xb59b4e94) at /home/alexander/src/coreclr/src/vm/jithelpers.cpp:1829
dotnet/runtime#3866 0xb653b14c in JIT_GetSharedGCThreadStaticBase (moduleDomainID=3046851056, dwClassDomainID=5) at /home/alexander/src/coreclr/src/vm/jithelpers.cpp:1929
dotnet/coreclr#38 0xaf88d41a in ?? ()
alpencolt commented 7 years ago

We use Exynos 5422 which has 4 A7 and 4 A15 cores. The funniest thing is that when I run dotnet only on one type of core (e.g only on A7 or A15) by using taskset bug cannot be reproduced. But if I use mixed mode (like taskset -c 2-5 corerun ...) crash occurs.

alpencolt commented 7 years ago

Guys from Mono faced problem caused different cache size in big.LITTLE architectire: http://www.mono-project.com/news/2016/09/12/arm64-icache/

I've checked source code and __clear_cach is used in FlushInstructionCache() only, I'm not sure that it's our case. Bug cannot be reproduced on Samsung Z3 (btw it has only 2 cores with identical architectures.).

lemmaa commented 7 years ago

@alpencolt , Can you please make a chance to apply the patch, https://github.com/mono/mono/pull/3549/files, and share the results here? It is not urgent but it seems to be worth try if you agree.

alpencolt commented 7 years ago

@lemmaa I will try after closing current issues. Do you have any considerations where flush_icache() may occur?

alpencolt commented 7 years ago

I've got following exception on the latest master:

FailFast: 

   at System.Diagnostics.Debug.Assert(Boolean condition, String message, String detailMessage)
   at System.Threading.ExecutionContext.Restore(Thread currentThread, ExecutionContext executionContext)

   at System.Environment.FailFast(System.String, System.Exception)
   at System.Diagnostics.Debug.ShowAssertDialog(System.String, System.String, System.String)
   at System.Diagnostics.Debug.Assert(Boolean, System.String, System.String)
   at System.Threading.ExecutionContext.Restore(System.Threading.Thread, System.Threading.ExecutionContext)
Aborted (core dumped)

This assertion failed:

internal static void Restore(Thread currentThread, ExecutionContext executionContext)
{
    Debug.Assert(currentThread == Thread.CurrentThread);
    ...
}
alpencolt commented 7 years ago

Debug version failed with the same exception as dotnet/runtime#7825. Back trace:

#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
dotnet/coreclr#1  0xb6c25648 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
dotnet/coreclr#2  0xb6c2634a in __GI_abort () at abort.c:89
dotnet/coreclr#3  0xb6921484 in PROCAbort () at /home/alexander/src/coreclr/src/pal/src/thread/process.cpp:3046
dotnet/coreclr#4  0xb691e188 in PROCEndProcess (hProcess=0xffffff01, uExitCode=123456789, bTerminateUnconditionally=1) at /home/alexander/src/coreclr/src/pal/src/thread/process.cpp:1394
dotnet/coreclr#5  0xb691e2c6 in TerminateProcess (hProcess=0xffffff01, uExitCode=123456789) at /home/alexander/src/coreclr/src/pal/src/thread/process.cpp:1310
dotnet/coreclr#6  0xb62d963e in TerminateOnAssert () at /home/alexander/src/coreclr/src/utilcode/debug.cpp:183
dotnet/coreclr#7  0xb62d9e76 in _DbgBreakCheck (szFile=0xb69bc6a8 "/home/alexander/src/coreclr/src/vm/object.cpp", iLine=1738, szExpr=0xb69bca15 "!CREATE_CHECK_STRING(pMT)", fConstrained=0) at /home/alexander/src/coreclr/src/utilcode/debug.cpp:436
dotnet/coreclr#8  0xb62da254 in _DbgBreakCheckNoThrow (szFile=0xb69bc6a8 "/home/alexander/src/coreclr/src/vm/object.cpp", iLine=1738, szExpr=0xb69bca15 "!CREATE_CHECK_STRING(pMT)", fConstrained=0)
    at /home/alexander/src/coreclr/src/utilcode/debug.cpp:548
dotnet/coreclr#9  0xb62da5ae in DbgAssertDialog (szFile=0xb69bc6a8 "/home/alexander/src/coreclr/src/vm/object.cpp", iLine=1738, szExpr=0xb69bca15 "!CREATE_CHECK_STRING(pMT)") at /home/alexander/src/coreclr/src/utilcode/debug.cpp:735
dotnet/coreclr#10 0xb6389d66 in Object::ValidateInner (this=0xadafddf8, bDeep=1, bVerifyNextHeader=1, bVerifySyncBlock=1) at /home/alexander/src/coreclr/src/vm/object.cpp:1738
dotnet/coreclr#11 0xb6389288 in Object::Validate (this=0xadafddf8, bDeep=1, bVerifyNextHeader=1, bVerifySyncBlock=1) at /home/alexander/src/coreclr/src/vm/object.cpp:1709
dotnet/coreclr#12 0xb654e148 in GcInfoDecoder::ReportStackSlotToGC (this=0xadaf90bc, spOffset=-16, spBase=GC_FRAMEREG_REL, gcFlags=0, pRD=0xadaf99b8, flags=2, pCallBack=0xb6466d3d <GcEnumObject(void*, OBJECTREF*, unsigned int)>, hCallBack=0xadaf9fe0)
    at /home/alexander/src/coreclr/src/vm/gcinfodecoder.cpp:1826
dotnet/coreclr#13 0xb654e8fc in GcInfoDecoder::ReportSlotToGC (this=0xadaf90bc, slotDecoder=..., slotIndex=13, pRD=0xadaf99b8, reportScratchSlots=false, inputFlags=2, pCallBack=0xb6466d3d <GcEnumObject(void*, OBJECTREF*, unsigned int)>, 
    hCallBack=0xadaf9fe0) at /home/alexander/src/coreclr/src/inc/gcinfodecoder.h:665
dotnet/coreclr#14 0xb654cdea in GcInfoDecoder::EnumerateLiveSlots (this=0xadaf90bc, pRD=0xadaf99b8, reportScratchSlots=false, inputFlags=2, pCallBack=0xb6466d3d <GcEnumObject(void*, OBJECTREF*, unsigned int)>, hCallBack=0xadaf9fe0)
    at /home/alexander/src/coreclr/src/vm/gcinfodecoder.cpp:934
dotnet/coreclr#15 0xb630db54 in EECodeManager::EnumGcRefs (this=0x61ff0, pRD=0xadaf99b8, pCodeInfo=0xadaf9870, flags=2, pCallBack=0xb6466d3d <GcEnumObject(void*, OBJECTREF*, unsigned int)>, hCallBack=0xadaf9fe0, relOffsetOverride=4294967295)
    at /home/alexander/src/coreclr/src/vm/eetwain.cpp:5062
dotnet/coreclr#16 0xb6467294 in GcStackCrawlCallBack (pCF=0xadaf9650, pData=0xadaf9fe0) at /home/alexander/src/coreclr/src/vm/gcenv.ee.common.cpp:280
dotnet/coreclr#17 0xb63b01c8 in Thread::MakeStackwalkerCallback (this=0xe7dc8, pCF=0xadaf9650, pCallback=0xb6466eb9 <GcStackCrawlCallBack(CrawlFrame*, void*)>, pData=0xadaf9fe0, uFramesProcessed=52)
    at /home/alexander/src/coreclr/src/vm/stackwalk.cpp:864
dotnet/coreclr#18 0xb63b036a in Thread::StackWalkFramesEx (this=0xe7dc8, pRD=0xadaf99b8, pCallback=0xb6466eb9 <GcStackCrawlCallBack(CrawlFrame*, void*)>, pData=0xadaf9fe0, flags=34048, pStartFrame=0x0)
    at /home/alexander/src/coreclr/src/vm/stackwalk.cpp:945
dotnet/coreclr#19 0xb63b0bee in Thread::StackWalkFrames (this=0xe7dc8, pCallback=0xb6466eb9 <GcStackCrawlCallBack(CrawlFrame*, void*)>, pData=0xadaf9fe0, flags=34048, pStartFrame=0x0) at /home/alexander/src/coreclr/src/vm/stackwalk.cpp:1028
dotnet/coreclr#20 0xb66cc9d4 in ScanStackRoots (pThread=0xe7dc8, fn=0xb65b8bb9 <WKS::GCHeap::Promote(Object**, ScanContext*, unsigned int)>, sc=0xadafa140) at /home/alexander/src/coreclr/src/vm/gcenv.ee.cpp:149
dotnet/coreclr#21 0xb66cc754 in GCToEEInterface::GcScanRoots (fn=0xb65b8bb9 <WKS::GCHeap::Promote(Object**, ScanContext*, unsigned int)>, condemned=0, max_gen=2, sc=0xadafa140) at /home/alexander/src/coreclr/src/vm/gcenv.ee.cpp:178
dotnet/coreclr#22 0xb670d01e in GCScan::GcScanRoots (fn=0xb65b8bb9 <WKS::GCHeap::Promote(Object**, ScanContext*, unsigned int)>, condemned=0, max_gen=2, sc=0xadafa140) at /home/alexander/src/coreclr/src/gc/gcscan.cpp:155
dotnet/coreclr#23 0xb65ad2fe in WKS::gc_heap::mark_phase (condemned_gen_number=0, mark_only_p=0) at /home/alexander/src/coreclr/src/gc/gc.cpp:19585
dotnet/coreclr#24 0xb65ab154 in WKS::gc_heap::gc1 () at /home/alexander/src/coreclr/src/gc/gc.cpp:15396
dotnet/runtime#3858 0xb65b36a2 in WKS::gc_heap::garbage_collect (n=0) at /home/alexander/src/coreclr/src/gc/gc.cpp:16947
dotnet/runtime#3859 0xb65a7620 in WKS::GCHeap::GarbageCollectGeneration (this=0x470b8, gen=0, reason=reason_alloc_soh) at /home/alexander/src/coreclr/src/gc/gc.cpp:35074
dotnet/coreclr#27 0xb65a89b0 in WKS::gc_heap::try_allocate_more_space (acontext=0xe7e08, size=20, gen_number=0) at /home/alexander/src/coreclr/src/gc/gc.cpp:13198
dotnet/runtime#3860 0xb65a8ac8 in WKS::gc_heap::allocate_more_space (acontext=0xe7e08, size=20, alloc_generation_number=0) at /home/alexander/src/coreclr/src/gc/gc.cpp:13494
dotnet/runtime#3861 0xb65d1000 in WKS::gc_heap::allocate (jsize=20, acontext=0xe7e08) at /home/alexander/src/coreclr/src/gc/gc.cpp:13525
dotnet/runtime#3862 0xb65c8fc4 in WKS::GCHeap::Alloc (this=0x470b8, context=0xe7e08, size=20, flags=0) at /home/alexander/src/coreclr/src/gc/gc.cpp:34416
dotnet/runtime#3863 0xb6469816 in Alloc (size=20, bFinalize=0, bContainsPointers=0) at /home/alexander/src/coreclr/src/vm/gchelpers.cpp:241
dotnet/coreclr#32 0xb64689ac in FastAllocatePrimitiveArray (pMT=0xb3c371e0, cElements=8, bAllocateInLargeHeap=0) at /home/alexander/src/coreclr/src/vm/gchelpers.cpp:824
dotnet/runtime#3864 0xb64836d0 in JIT_NewArr1 (arrayMT=0xb3c371e0, size=8) at /home/alexander/src/coreclr/src/vm/jithelpers.cpp:3228
dotnet/runtime#3865 0xab5b4d1e in ?? ()

After few times test were passed it failed with this stack. So as you can see there is garbage collection on JIT_NewArr1() calling. The reason is that GetGCSafeMethodTable() returns 0. Also I've found that before crash StackWalker iterates over big amount of frames and there is one frame with Frame::FRAME_ATTR_EXCEPTION attribute.

alpencolt commented 7 years ago

Crash on debug is another error dotnet/coreclr#14238. Root cause is the same - assertion in System.Threading.ExecutionContext.Restore()

echesakov commented 6 years ago

@mskvortsov @alpencolt I am trying to reproduce this issue on my Ubuntu arm machine and the test always passes. Is that still reproducible on your machine? If so, can you please give me the kernel version you are running on?

RussKeldorph commented 6 years ago

@kbaladurin @okodron Are you guys able to repro this?

kbaladurin commented 6 years ago

Yes, crash is still reproducible for me (on https://github.com/dotnet/coreclr/commit/91ce6edff7b897f1356ba43af7b26565639fb6fc):

$ taskset -c 2-5 ./repro.sh 
.......
Failed on a try dotnet/coreclr#8
$ gdb -c core Linux.arm.Release/corerun
...
(gdb) bt
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
dotnet/coreclr#1  0xb6c39648 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
dotnet/coreclr#2  0xb6c3a34a in __GI_abort () at abort.c:89
dotnet/coreclr#3  0xb69bc61c in PROCAbort () at /media/kbaladurin/data/dotnet/forked/coreclr/src/pal/src/thread/process.cpp:3068
dotnet/coreclr#4  0xb69bb832 in PROCEndProcess (hProcess=<optimized out>, uExitCode=2, bTerminateUnconditionally=-1372588976) at /media/kbaladurin/data/dotnet/forked/coreclr/src/pal/src/thread/process.cpp:1394
dotnet/coreclr#5  0xb6767de2 in SafeExitProcess (exitCode=3221225477, fAbort=1, sca=SCA_ExitProcessWhenShutdownComplete) at /media/kbaladurin/data/dotnet/forked/coreclr/src/vm/eepolicy.cpp:519
dotnet/coreclr#6  0xb6768fec in EEPolicy::HandleFatalError (exitCode=3221225477, address=<optimized out>, pszMessage=<optimized out>, pExceptionInfo=<optimized out>, errorSource=<optimized out>, 
    argExceptionString=<optimized out>) at /media/kbaladurin/data/dotnet/forked/coreclr/src/vm/eepolicy.cpp:1545
dotnet/coreclr#7  0xb67d763e in ProcessCLRException (pExceptionRecord=0xad9007d0, MemoryStackFp=<optimized out>, pContextRecord=<optimized out>, pDispatcherContext=<optimized out>)
    at /media/kbaladurin/data/dotnet/forked/coreclr/src/vm/exceptionhandling.cpp:1029
dotnet/coreclr#8  0xb67da5e0 in UnwindManagedExceptionPass1 (ex=..., frameContext=0xae2fe358) at /media/kbaladurin/data/dotnet/forked/coreclr/src/vm/exceptionhandling.cpp:4630
dotnet/coreclr#9  0xb67da81a in DispatchManagedException (ex=..., isHardwareException=<optimized out>) at /media/kbaladurin/data/dotnet/forked/coreclr/src/vm/exceptionhandling.cpp:4752
dotnet/coreclr#10 0xb67d66bc in HandleHardwareException (ex=0xae2fe728) at /media/kbaladurin/data/dotnet/forked/coreclr/src/vm/exceptionhandling.cpp:5275
dotnet/coreclr#11 0xb6995ba4 in SEHProcessException (exception=0xae2fe728) at /media/kbaladurin/data/dotnet/forked/coreclr/src/pal/src/exception/seh.cpp:286
dotnet/coreclr#12 0xb6996d70 in common_signal_handler (code=11, siginfo=<optimized out>, sigcontext=0xad903ec0, numParams=<optimized out>)
    at /media/kbaladurin/data/dotnet/forked/coreclr/src/pal/src/exception/signal.cpp:897
dotnet/coreclr#13 0xb6996c7c in signal_handler_worker (code=0, siginfo=0xad903e40, context=0x6, returnPoint=0xad903c40) at /media/kbaladurin/data/dotnet/forked/coreclr/src/pal/src/exception/signal.cpp:436
dotnet/coreclr#14 <signal handler called>
dotnet/coreclr#15 0xafe2a704 in ?? ()
dotnet/coreclr#16 0xafe2a6d8 in ?? ()

kernel version is 4.9.58-71:

$ uname -a
Linux odroid 4.9.58-71 dotnet/coreclr#1 SMP PREEMPT Wed Oct 25 21:02:48 UTC 2017 armv7l armv7l armv7l GNU/Linux

We use processor with big.LITTLE architecture:

$ cat /proc/cpuinfo 
processor   : 0
model name  : ARMv7 Processor rev 3 (v7l)
BogoMIPS    : 84.00
Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae 
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part    : 0xc07
CPU revision    : 3

processor   : 1
model name  : ARMv7 Processor rev 3 (v7l)
BogoMIPS    : 84.00
Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae 
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part    : 0xc07
CPU revision    : 3

processor   : 2
model name  : ARMv7 Processor rev 3 (v7l)
BogoMIPS    : 84.00
Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae 
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part    : 0xc07
CPU revision    : 3

processor   : 3
model name  : ARMv7 Processor rev 3 (v7l)
BogoMIPS    : 84.00
Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae 
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part    : 0xc07
CPU revision    : 3

processor   : 4
model name  : ARMv7 Processor rev 3 (v7l)
BogoMIPS    : 120.00
Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae 
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part    : 0xc0f
CPU revision    : 3

processor   : 5
model name  : ARMv7 Processor rev 3 (v7l)
BogoMIPS    : 120.00
Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae 
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part    : 0xc0f
CPU revision    : 3

processor   : 6
model name  : ARMv7 Processor rev 3 (v7l)
BogoMIPS    : 120.00
Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae 
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part    : 0xc0f
CPU revision    : 3

processor   : 7
model name  : ARMv7 Processor rev 3 (v7l)
BogoMIPS    : 120.00
Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae 
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part    : 0xc0f
CPU revision    : 3

Hardware    : ODROID-XU3
Revision    : 0100
Serial      : 0000000000000000
RussKeldorph commented 6 years ago

@kbaladurin @okodron Any chance you guys could investigate this? We don't currently have any way to repro it.

alpencolt commented 6 years ago

@echesakovMSFT @RussKeldorph we still investigating this issue but have a lot of other work with higher priority, so I'm not sure that it will be closed soon.

Crash is reproduced only on devices with big.LITTLE architecture. There is chance that there is something wrong in environment and we will try to run it on new boards.

RussKeldorph commented 6 years ago

Ok, I think we'll most likely have to kick this out of 2.1 for now. I don't see how we can make progress on it soon enough.

RussKeldorph commented 6 years ago

I'm inclined to think this is more likely related to PAL or maybe GC than codegen (do we query the processor cache size anywhere?), but feel free to clear the area label if you disagree.

janvorli commented 5 years ago

This issue is due to an issue described in https://www.kayaksoft.com/blog/2016/05/11/random-sigill-on-arm-board-odroid-ux4-with-gdbgdbserver/. The article also describes a kernel patch that mitigates the problem. I have finally got to trying to patch my Odroid XU4 kernel (XU4 is based on Exynos 5422) so that it returns cache line size 32 unconditionally for both little and big cores. And I can confirm it fixes the problem. With that patch applied, I could build managed parts of coreclr repo and all the thousands of coreclr managed tests just fine. Without that, I couldn't even build System.Private.CoreLib.dll.

It seems there is no way to fix that programatically, as the cache can be flushed only by the kernel on arm and the issue is in the kernel code.

Btw, ARM doc has some details on different cache sizes in heterogenous systems: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0438e/BABHAEIF.html

It seems that Linux kernel has a fix for the same kind of issue for ARM64 (https://osdn.net/projects/android-x86/scm/git/kernel/commits/116c81f427ff6c5380850963e3fb8798cc821d2b), but not for ARM.

alpencolt commented 5 years ago

With introducing Tiered JIT this issue occurs much more frequent and makes CoreCLR almost unusable on ARM32 CPUs with big.LITLLE architecture (thanks they are not very popular). Unfortunately we can not solve this issues from user space (like Mono did it for ARM64), so we're going to prepare fix to linux kernel like it was made for ARM64 (https://github.com/torvalds/linux/commit/116c81f427ff6c5380850963e3fb8798cc821d2b).

Patch from article helps:

diff --git a/arch/arm/mm/proc-macros.S b/arch/arm/mm/proc-macros.S
index 81d0efb05..045925b04 100644
--- a/arch/arm/mm/proc-macros.S
+++ b/arch/arm/mm/proc-macros.S
@@ -91,16 +91,7 @@
  * on ARMv7.
  */
        .macro  icache_line_size, reg, tmp
-#ifdef CONFIG_CPU_V7M
-       movw    \tmp, #:lower16:BASEADDR_V7M_SCB + V7M_SCB_CTR
-       movt    \tmp, #:upper16:BASEADDR_V7M_SCB + V7M_SCB_CTR
-       ldr     \tmp, [\tmp]
-#else
-       mrc     p15, 0, \tmp, c0, c0, 1         @ read ctr
-#endif
-       and     \tmp, \tmp, #0xf                @ cache line size encoding
-       mov     \reg, dotnet/coreclr#4                        @ bytes per word
-       mov     \reg, \reg, lsl \tmp            @ actual cache line size
+       mov     \reg, dotnet/coreclr#32
        .endm

 /*

But I'm not sure that it will be accepted in upstream=)

janvorli commented 5 years ago

@alpencolt can we close the issue? There doesn't seem to be anything we can do about the issue as it is a kernel problem.

alpencolt commented 5 years ago

@janvorli yes. I will add link to patch when it will be in upstream.