dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.07k stars 4.69k forks source link

.NET 7 - Segmentation Fault on BeagleBone Black (arm32) #78657

Open Limb opened 1 year ago

Limb commented 1 year ago

I'm getting a Segmentation Fault after installing .NET 7.0 SDK (v7.0.100) - Linux Arm32 on a BeagleBone Black running Debian 10.4. I tried both an existing upgrade from 6.0, as well as a fresh SD card image.

Running just the dotnet command works fine, but any other command (even dotnet --version) results in a segmentation fault.

debian@beaglebone:~$ dotnet

Usage: dotnet [options]
Usage: dotnet [path-to-application]

Options:
  -h|--help         Display help.
  --info            Display .NET information.
  --list-sdks       Display the installed SDKs.
  --list-runtimes   Display the installed runtimes.

path-to-application:
  The path to an application .dll file to execute.
debian@beaglebone:~$ dotnet --version
Segmentation fault

Running strace dotnet --version the command seems to output properly (however I am not a strace expert by any means but I am seeing a proper "write" command in the output):

clock_gettime(CLOCK_MONOTONIC, {tv_sec=1948, tv_nsec=283537198}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=1948, tv_nsec=285314775}) = 0
cacheflush(0xb55e7948, 0xb55e7960, 0)   = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=1948, tv_nsec=290090158}) = 0
cacheflush(0xab7cab64, 0xab7cabd0, 0)   = 0
ioctl(0, TCGETS, {B38400 opost isig icanon echo ...}) = 0
write(69, "7.0.100", 77.0.100)                 = 7
write(69, "\n", 1
)                      = 1
unlink("/tmp/dotnet-diagnostic-13773-193303-socket") = 0
futex(0x216f248, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x216f208, FUTEX_WAKE_PRIVATE, 1) = 1
unlink("/tmp/clr-debug-pipe-13773-193303-in") = 0
unlink("/tmp/clr-debug-pipe-13773-193303-out") = 0
write(4, "\3", 1)                       = 1
clock_gettime(CLOCK_MONOTONIC, {tv_sec=1948, tv_nsec=318814829}) = 0
munmap(0xb6ca3000, 225874)              = 0
exit_group(0)                           = ?
+++ exited with 0 +++
dotnet-issue-labeler[bot] commented 1 year ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 1 year ago

Tagging subscribers to this area: @vitek-karas, @agocke, @vsadov See info in area-owners.md if you want to be subscribed.

Issue Details
I'm getting a Segmentation Fault after installing .NET 7.0 SDK (v7.0.100) - Linux Arm32 on a BeagleBone Black running Debian 10.4. I tried both an existing upgrade from 6.0, as well as a fresh SD card image. Running just the dotnet command works fine, but any other command (even dotnet --version) results in a segmentation fault. ``` debian@beaglebone:~$ dotnet Usage: dotnet [options] Usage: dotnet [path-to-application] Options: -h|--help Display help. --info Display .NET information. --list-sdks Display the installed SDKs. --list-runtimes Display the installed runtimes. path-to-application: The path to an application .dll file to execute. debian@beaglebone:~$ dotnet --version Segmentation fault ``` Running ```strace dotnet --version``` the command seems to output properly (however I am not a strace expert by any means but I am seeing a proper "write" command in the output): ```cacheflush(0xb55e7908, 0xb55e7930, 0) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=1948, tv_nsec=283537198}) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=1948, tv_nsec=285314775}) = 0 cacheflush(0xb55e7948, 0xb55e7960, 0) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=1948, tv_nsec=290090158}) = 0 cacheflush(0xab7cab64, 0xab7cabd0, 0) = 0 ioctl(0, TCGETS, {B38400 opost isig icanon echo ...}) = 0 write(69, "7.0.100", 77.0.100) = 7 write(69, "\n", 1 ) = 1 unlink("/tmp/dotnet-diagnostic-13773-193303-socket") = 0 futex(0x216f248, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x216f208, FUTEX_WAKE_PRIVATE, 1) = 1 unlink("/tmp/clr-debug-pipe-13773-193303-in") = 0 unlink("/tmp/clr-debug-pipe-13773-193303-out") = 0 write(4, "\3", 1) = 1 clock_gettime(CLOCK_MONOTONIC, {tv_sec=1948, tv_nsec=318814829}) = 0 munmap(0xb6ca3000, 225874) = 0 exit_group(0) = ? +++ exited with 0 +++ ```
Author: Limb
Assignees: marcpopMSFT
Labels: `area-Host`, `untriaged`
Milestone: -
marcpopMSFT commented 1 year ago

Routing to runtime as a likely issue in the runtime or host itself.

elinor-fung commented 1 year ago

@Limb Would it be possible for you to get/share a core dump of the crash?

Limb commented 1 year ago

@elinor-fung Hopefully I created the core dump successfully, please see core

elinor-fung commented 1 year ago

Seems to be in Microsoft.TemplateEngine.Cli.dll:

[0x0]   0xb556c001!+   0xbebcc000   0xabb0b04b   
[0x1]   Microsoft_TemplateEngine_Cli + 0x5c04b!Microsoft_TemplateEngine_Cli+0x5c04b   0xbebcc000   0xabb09ca1   
[0x2]   Microsoft_TemplateEngine_Cli + 0x5aca1!Microsoft_TemplateEngine_Cli+0x5aca1   0xbebcc050   0xabaec809   
[0x3]   Microsoft_TemplateEngine_Cli + 0x3d809!Microsoft_TemplateEngine_Cli+0x3d809   0xbebcc068   0xae8333a9   
[0x4]   dotnet_ae760000 + 0xd33a9!dotnet_ae760000+0xd33a9   0xbebcc080   0xae82e42b   
[0x5]   dotnet_ae760000 + 0xce42b!dotnet_ae760000+0xce42b   0xbebcc0b8   0x0   

@dotnet/dotnet-diag I'm failing at getting managed stacks from the dump. Can someone tell me what I'm missing? I'm running a docker container with linux arm32 and getting this:

(lldb) sosstatus
Target OS: LINUX Architecture: Arm ProcessId: 6323 (0x18B3)
#0 .NET Core runtime at 00000000B677F000 size 0046E2B9 index ad7c731feb7243bba7f5fa9b0f931444aaad01e0
    Runtime module path: /home/issues/libcoreclr.so
    Runtime module directory: /home/issues
    DAC: /home/issues/libmscordaccore.so

Current symbol store settings:
-> Cache: /root/.dotnet/symbolcache
-> Server: https://msdl.microsoft.com/download/symbols/ Timeout: 4 RetryCount: 0
GC memory usage for managed SOS components: 628,552 bytes
(lldb) clrstack
Failed to load data access module, 0x80004002
Can not load or initialize libmscordaccore.so. The target runtime may not be initialized.

For more information see https://go.microsoft.com/fwlink/?linkid=2135652
ClrStack  failed
michaldobrodenka commented 1 year ago

I think that dump might not be complete or doesn't contain managed info. I'm not expert on this, but I had problems with core dumps some time ago (also on arm32 ).

image

It usually means:

but i may be wrong :)

tommcdon commented 1 year ago

https://github.com/orgs/dotnet/teams/dotnet-diag I'm failing at getting managed stacks from the dump. Can someone tell me what I'm missing? I think that dump might not be complete or doesn't contain managed info. I'm not expert on this, but I had problems with core dumps some time ago (also on arm32 ).

Hi @elinor-fung and @michaldobrodenka! It does indeed look like the dump is incomplete. For more details on crash dump generation see here. I suggest collecting a full dump by setting DOTNET_DbgMiniDumpType=4. Hope this helps!

elinor-fung commented 1 year ago

Thanks, @michaldobrodenka and @tommcdon.

@Limb could you try collecting a full dump per https://github.com/dotnet/runtime/issues/78657#issuecomment-1325159100?

Limb commented 1 year ago

@elinor-fung here is the core dump using DOTNET_DbgMiniDumpType=4

Limb commented 1 year ago

I will add that by chance I happened to run the dotnet --version command again after generating the dump and it worked, but executing the command again resulted in a segmentation fault. So this issue does seem to be inconsistent in generating a fault all the time.

debian@beaglebone:~$ dotnet --version
7.0.100
debian@beaglebone:~$ dotnet --version
Segmentation fault (core dumped)
debian@beaglebone:~$ dotnet --version
Segmentation fault (core dumped)
debian@beaglebone:~$ dotnet --version
Segmentation fault (core dumped)
debian@beaglebone:~$ dotnet --version
7.0.100
elinor-fung commented 1 year ago

Thanks for the new dump. I'm still unable to get at the managed stacks for some reason.

cc @mikem8361 @hoyosjs who are far more useful than me at looking at dumps from linux arm32 - would one of you be able to take a look?

hoyosjs commented 1 year ago

I've tried looking at this a couple times but the dumps are in pretty bad state. We can't even load the DAC, but trying to attach to the debugger and setting a breakpoint ends up in hangs and stack overflows, even with LLDB 14. I can only really see:

(lldb) bt all
* thread #1, name = 'dotnet', stop reason = signal SIGSEGV
  * frame #0: 0xbebcc000 0xb556c000
    frame #1: 0xbebcc000 0xabb0b04a
    frame #2: 0xbebcc030 0xabb09ca0
    frame #3: 0xbebcc058 0xabaec808
    frame #4: 0xbebcc070 0xae8333a8
    frame #5: 0xbebcc088 0xae82e42a
    frame #6: 0xbebcc0c0 0xb6a52eee libcoreclr.so`CallDescrWorkerInternal + 54 at unixasmmacrosarm.inc:662
  thread #2, stop reason = signal 0
    frame #0: 0xb6767cf0 0xb6c70784 libc.so.6`__libc_start_main(main=(libc.so.6`__strptime_internal + 9317 at strptime_l.c:1110:8), argc=-1233748704, argv=0x00000074, init=0x00000000, fini=0xb67684b0, rtld_fini=0x9c98e400, stack_end=0xb6768440) + 376 at libc-start.c:333
  thread #3, stop reason = signal 0
    frame #0: 0xb5dfea00 0xb6c70784 libc.so.6`__libc_start_main(main=(libc.so.6`__strptime_internal + 9317 at strptime_l.c:1110:8), argc=-1254095296, argv=0x00000074, init=0x00000000, fini=0x00000000, rtld_fini=0x00000000, stack_end=0x00000000) + 376 at libc-start.c:333
  thread #4, stop reason = signal 0
    frame #0: 0xb53fe9c8 0xb6f24524 libpthread.so.0`check_add_mapping(name=0x00000000, namelen=18121792, fd=131072, existing=0x00000000) + 48 at sem_open.c:71
  thread #5, stop reason = signal 0
    frame #0: 0xb4bfdb40 0xb6f24524 libpthread.so.0`check_add_mapping(name="\x02", namelen=128, fd=0, existing=0x00000000) + 48 at sem_open.c:71
  thread #6, stop reason = signal 0
    frame #0: 0xb13fb7e0 0xb6f24524 libpthread.so.0`check_add_mapping(name="", namelen=128, fd=0, existing=0xb13fb828) + 48 at sem_open.c:71
  thread #7, stop reason = signal 0
    frame #0: 0xaf403b40 0xb6f24524 libpthread.so.0`check_add_mapping(name="", namelen=128, fd=0, existing=0xaf403b88) + 48 at sem_open.c:71

Loading the DAC gives CORDBG_E_MISSING_DEBUGGER_EXPORTS:

 Error: 0 : CreateRuntime FAILED: Microsoft.Diagnostics.Runtime.ClrDiagnosticsException: Failure loading DAC: CreateDacInstance failed 0x80131c4f
   at Microsoft.Diagnostics.Runtime.DacLibrary..ctor(DataTarget dataTarget, String dacPath, UInt64 runtimeBaseAddress)
   at Microsoft.Diagnostics.Runtime.ClrInfo.ConstructRuntime(String dac)
   at Microsoft.Diagnostics.Runtime.ClrInfo.CreateRuntime(String dacPath, Boolean ignoreMismatch)
   at Microsoft.Diagnostics.DebugServices.Implementation.Runtime.CreateRuntime()
 Error: 0 : CLRDataCreateInstance FAILED 80131C4F
 Information: 0 : DataTargetWrapper.Destroy
Failed to load data access module, 0x80004002

The GNU Hash comes back empty (it shouldn't, since it should be able to map in the elf file and read from it, cc @mikem8361). Even after fixing up the addresses, things are so corrupt I can only get this:

(lldb) clrstack -f
OS Thread Id: 0x18b3 (1)
Child SP       IP Call Site
BEBCC000 B556C000
BEBCC000 ABB0B04A
BEBCC030 ABB09CA0
BEBCC058 ABAEC808
BEBCC070 AE8333A8
BEBCC088 AE82E42A
BEBCC0C0 B6A52EEE libcoreclr.so!CallDescrWorkerInternal + 54 at /__w/1/s/src/coreclr/pal/inc/unixasmmacrosarm.inc:664
BEBCC88C          [DynamicHelperFrame: bebcc88c]
BEBCC900 AE83153C dotnet.dll!/home/debian/dotnet/sdk/7.0.100/dotnet.dll!Unknown + 140 <- r2r md: 06000BDA
BEBCC9D8 AE831166 dotnet.dll!/home/debian/dotnet/sdk/7.0.100/dotnet.dll!Unknown + 486

If I look at the Frame's TransitionBlock I see:

(TransitionBlock *) $4 = 0xbebcc8cc {
   = {
    m_calleeSavedRegisters = (r4 = 0xb6a5367f, r5 = 0x00000000, r6 = 0xb14085d0, r7 = 0xb141f89c, r8 = 0xb3400228, r9 = 0xb141f8c8, r10 = 0x00000000, r11 = 0xbebcc9c8, r14 = 0xae83153d)
     = (r4 = 0xb6a5367f, r5 = 0x00000000, r6 = 0xb14085d0, r7 = 0xb141f89c, r8 = 0xb3400228, r9 = 0xb141f8c8, r10 = 0x00000000, r11 = 0xbebcc9c8, m_ReturnAddress = 0xae83153d)
  }
  m_argumentRegisters = {
    r = ([0] = 0xb141f8c8, [1] = 0x00035848, [2] = 0x000e5db5, [3] = 0xae865d8d)
  }
}

Main issue is also something is busted with metadata, so I can't see names. The return address doesn't give me anything in the jit frame and the disassembly around it makes little sense. @elinor-fung you might be more familiar with DynamicHelper, what might be useful to check?

hoyosjs commented 1 year ago

Using IL Spy I see the last two frames have md tokens that are Microsoft.DotNet.Cli.Program.Main -> Microsoft.DotNet.Cli.Program.ProcessArgs. ~They are both r2r, but the disassembly is also odd for ARM.~ (The oddity was LLDB failint to detect it was thumbv2 assembly). Maybe running with DOTNET_ReadyToRun=0 might help?

The assembly right before the call is:

d152e: 49 f6 6a 74     movw    r4, #40810
d1532: c0 f2 11 04     movt    r4, #17
d1536: 7c 44           add     r4, pc
d1538: 23 68           ldr     r3, [r4]
d153a: 98 47           blx     r3

with fixups it's

   0xae83100e   movw   r4, #0xc3b6
   0xae831012   movt   r4, #0x11
   0xae831016   add    r4, pc
   0xae831018   ldr    r1, [r4]
   0xae83101a   blx    r1

The assembly doesn't explain the next frames in the stack though, so not sure I am interpreting this quite right.

JustSuperHuman commented 1 year ago

I'm wondering if anyone has an update on this. Does it happen w/ the .NET 8 beta?

hoyosjs commented 1 year ago

I don't think anyone is actively working on this. I am not sure anything has changed here.

Limb commented 1 year ago

Sorry, I got sidetracked dealing with this. I will run the additional requested tests and also try this with .NET 8 and report back this weekend.

On Fri, Mar 10, 2023, at 6:33 PM, Juan Hoyos wrote:

I don't think anyone is actively working on this. I am not sure anything has changed here.

— Reply to this email directly, view it on GitHub https://github.com/dotnet/runtime/issues/78657#issuecomment-1464641579, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB2Y4H2QM6IOX6M6HNIHYDW3O243ANCNFSM6AAAAAASHCQPJA. You are receiving this because you were mentioned.Message ID: @.***>

Limb commented 1 year ago

Some good and bad news to report. The issue still happens with SDK 8.0.100-preview.1.

However, running commands with DOTNET_ReadyToRun=0 as recommended by @hoyosjs results in the dotnet command running properly. This is true for both .NET 7 and 8.

I was able to run dotnet --version dotnet new console and dotnet run without issues.

JustSuperHuman commented 1 year ago

Ty for testing @Limb - Weird enough I'm testing now w/ 7.0.200 and it seems to be working without adding the ReadyToRun flag 🤔

image

JustSuperHuman commented 1 year ago

Ran into new issues sadly, Am able to run projects normally by launching w/ dotnet program.dll (.NET 7 and 8) but get a segmentation fault when running from a service. systemctl start program (works fine w/ .NET 6)

Just putting it out there in case anyone finds a way around it.

Tried putting DOTNET_ReadyToRun=0 in /etc/environment but no go.

outlookhazy commented 1 year ago

I'm not positive this is related, but I've been running into an issue with a .net7 published application running on BBB with behavior that sounds extremely similar (inconsistent segfaults). For what it's worth, the application appears to run fine if I turn off address randomization for the process (setarch -R ./program). I recompiled under .net6 with the only change being some datetime microseconds references (that aren't supported) and the application runs correctly every time. Unfortunately that native high-precision timing support is necessary for my application.

A couple more datapoints:

I can't prove it, but it almost seems like it's more likely to reach execution of application code as the system gets loaded down. This smells like some kind of use-after-free race condition with maybe garbage collection(?)

bklop commented 9 months ago

Any update to this? Will this be fixed in a future release?

We ran into this after upgrading our application from .net 6 to 8, took a while to figure out that the problem was not in our code. At least the workaround mentioned by @outlookhazy seems to work for now (disabling ASLR)...

Blackclaws commented 5 days ago

So we're still seeing this issue on the current net 8.0 the workaround with DOTNET_ReadyToRun=0 seems to work but this should get fixed somehow.

michaldobrodenka commented 5 days ago

This issue fit into pattern mentioned here https://github.com/dotnet/runtime/issues/102396#issuecomment-2358863491

Still a small sample, but maybe there is a problem with older ARMv7 cores. Beaglebone black has ARM A8. Maybe it's connected to VFPv3 vs VFPv4