dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.98k stars 4.66k forks source link

Segmentation fault on arm32 (raspberry-pi3) #8829

Closed karelz closed 4 years ago

karelz commented 7 years ago

From @SteveL-MSFT on August 29, 2017 22:25

After building powershell with runtime linux-arm, it runs until it hits a second ManualResetEvent::WaitOne() call and results in SegFault. Stack trace from gdb:

Thread 23 "powershell" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x694e1450 (LWP 11108)]
0x76692ecc in VirtualCallStubManager::predictStubKind(unsigned int) () from /home/pi/powershell/libcoreclr.so
(gdb) backtrace
#0  0x76692ecc in VirtualCallStubManager::predictStubKind(unsigned int) () from /home/pi/powershell/libcoreclr.so
dotnet/coreclr#1  0x766981d6 in VirtualCallStubManager::getStubKind(unsigned int) () from /home/pi/powershell/libcoreclr.so
dotnet/coreclr#2  0x766951b4 in VirtualCallStubManager::FindStubManager(unsigned int, VirtualCallStubManager::StubKind*) ()
   from /home/pi/powershell/libcoreclr.so
dotnet/coreclr#3  0x7669698e in VSD_ResolveWorker () from /home/pi/powershell/libcoreclr.so
dotnet/coreclr#4  0x7673cb30 in ResolveWorkerAsmStub () from /home/pi/powershell/libcoreclr.so
dotnet/coreclr#5  0x687ca346 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) 

Copied from original issue: dotnet/corefx#23660

karelz commented 7 years ago

From @danmosemsft on August 29, 2017 22:32

@janvorli

SteveL-MSFT commented 7 years ago

Thanks for opening this in the right repo :)

danmoseley commented 7 years ago

@janvorli should this go to Tizen?

whatevergeek commented 7 years ago

Hmmm... i doubt... main obective for this is to have powershell (.net core 2.0) running on raspberry pi 3... but if there's Tizen stuff that can help resolve this... perhaps, we can link Tizen people to this link...

janvorli commented 7 years ago

@danmosemsft I will take a look myself. @SteveL-MSFT could you please provide me with steps or a pointer to steps on how to build powershell targeting ARM Linux and to repro the issue?

SteveL-MSFT commented 7 years ago

@janvorli you can clone https://github.com/stevel-msft/powershell/tree/raspberry-pi onto a Ubuntu16.04 box, install PSCore6, start powershell, run ipmo ./build.psm1, run start-psbootstrap -buildlinuxarm, then start-psbuild -runtime linux-arm or tomorrow I can give you ssh access to my pi on corpnet (it's a holiday in US today)

janvorli commented 7 years ago

@SteveL-MSFT does the PSCore6 exist for 16.04 only? I have a 14.04 box, I was able to install powershell, but apt-get cannot find a package called PSCore6. I have thought that you might have meant powershell by that, but the ipmo command doesn't exist either.

SteveL-MSFT commented 7 years ago

Got @janvorli working

whatevergeek commented 7 years ago

@SteveL-MSFT @janvorli wow! glad to know you got it working... was so looking forward to this... got a snapshot of the repo (is it in another branch?) that i can use? I'd like to run powershell on my pi also.

SteveL-MSFT commented 7 years ago

@whatevergeek just to be clear 'working' means I got him a repro of the crash locally so he can debug, not that we got PowerShell working on arm32 yet

janvorli commented 7 years ago

I've debugged the issue and it is a codegen issue. The ResolveWorkerAsmStub expects to get indirection cell address combined with two flag bits in the register R4, but it gets an address of an argument shuffling thunk instead. The managed frame (the frame dotnet/coreclr#5 in the stack trace in the issue description above) is a frame of the following function:

DomainNeutralILStubClass.IL_STUB_SecureDelegate_Invoke(System.__Canon, System.__Canon, System.__Canon, System.__Canon, System.__Canon)
=> 0xa87e9a24:  push    {r2, r3, r4, lr}
   0xa87e9a26:  ldr.w   lr, [sp, dotnet/coreclr#16]
   0xa87e9a2a:  str.w   lr, [sp]
   0xa87e9a2e:  ldr.w   lr, [sp, dotnet/coreclr#20]
   0xa87e9a32:  str.w   lr, [sp, dotnet/coreclr#4]
   0xa87e9a36:  ldr     r0, [r0, dotnet/coreclr#20]
   0xa87e9a38:  add.w   r4, r0, dotnet/coreclr#16
   0xa87e9a3c:  ldr     r4, [r0, dotnet/coreclr#12]
   0xa87e9a3e:  ldr     r0, [r0, dotnet/coreclr#4]
   0xa87e9a40:  blx     r4
   0xa87e9a42:  pop     {r2, r3, r4, pc}

This function calls an argument shuffling thunk via the blx r4. The thunk's code is below:

=> 0xb5b062b0:  push    {r4, r5, r6, lr}
   0xb5b062b2:  ldr.w   r12, [r0, dotnet/coreclr#16]
   0xb5b062b6:  addw    r4, sp, dotnet/coreclr#16
   0xb5b062ba:  addw    r5, sp, dotnet/coreclr#16
   0xb5b062be:  mov     r0, r1
   0xb5b062c0:  mov     r1, r2
   0xb5b062c2:  mov     r2, r3
   0xb5b062c4:  ldr.w   r3, [r4], dotnet/coreclr#4
   0xb5b062c8:  ldr.w   r6, [r4], dotnet/coreclr#4
   0xb5b062cc:  str.w   r6, [r5], dotnet/coreclr#4
   0xb5b062d0:  str.w   r12, [sp, dotnet/coreclr#12]
   0xb5b062d4:  pop     {r4, r5, r6, pc}

This thunk replaces the LR pushed by the first push by the value taken from [R0+16] and so the pop at the end jumps to the following piece of code:

=> 0xb59b9f10:  ldr.w   r12, [pc, dotnet/coreclr#8]   ; 0xb59b9f1c
   0xb59b9f14:  ldr.w   pc, [pc]        ; 0xb59b9f18

The values at the pc and pc + 8 are as follows:

(gdb) x/2dx 0xb59b9f18
0xb59b9f18:     0xb66f2ced      0x0000000c

So this piece of code jumps to 0xb66f2ced, which is the ResolveWorkerAsmStub asm helper. And now we are coming to the culprit. As I've already said, this asm helper expects R4 to contain the indirection cell address. But as you can see, the argument shuffling thunk didn't touch R4 and so we get the R4 that came from the DomainNeutralILStubClass.IL_STUB_SecureDelegate_Invoke. And as you can see, R4 was used to jump to the argument shuffling thunk so it contains its address.

So I believe this is a JIT codegen bug. If you look at the generated code of the DomainNeutralILStubClass.IL_STUB_SecureDelegate_Invoke, you can see that at 0xa87e9a38, the indirection cell address was loaded to R4, but right in the next instruction, it was overwritten by the address that the blx called a bit later.

janvorli commented 7 years ago

cc: @dotnet/jit-contrib

jkotas commented 7 years ago

Here is the problem: https://github.com/dotnet/coreclr/blob/3297fd43b6d78c025e3befa3b6242229deaa9094/src/jit/codegenlegacy.cpp#L18667

jkotas commented 7 years ago

Also, R4 is loaded as EA_PTRSIZE in the line above. Instead, it should be loaded as EA_BYREF.

mi-hol commented 7 years ago

@janvorli @jkotas great finding. I wonder how close are you to fix the root cause?

janvorli commented 7 years ago

@mi-hol I am just building coreclr with a fix so that I can test it with powershell on my RPI3. So I think I will probably send out PR with the fix later today.

janvorli commented 7 years ago

I have confirmed that the fix at the place that @jkotas has suggested fixes the powershell. It has started correctly and I've tried a couple of basic commands and they worked.

janvorli commented 7 years ago

Fixed by dotnet/coreclr#13922