dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.26k stars 4.73k forks source link

Illegal instruction in libhostfxr.so #109131

Open jpauwels opened 2 weeks ago

jpauwels commented 2 weeks ago

I'm trying to install Jellyfin on a fresh install of Debian Bookworm armhf on a Marvel Armada processor.

 $ uname -a
Linux hostname 6.1.0-25-armmp #1 SMP Debian 6.1.106-3 (2024-08-26) armv7l GNU/Linux

$ cat /proc/cpuinfo
processor       : 0
model name      : ARMv7 Processor rev 1 (v7l)
BogoMIPS        : 41.55
Features        : half thumb fastmult vfp edsp thumbee vfpv3 vfpv3d16 tls idivt
CPU implementer : 0x56
CPU architecture: 7
CPU variant     : 0x1
CPU part        : 0x581
CPU revision    : 1

Hardware        : Marvell Armada 370/XP (Device Tree)
Revision        : 0000
Serial          : 0000000000000000

Starting the program fails with an Illegal Instruction. Digging into the build process, it appears that even a simple dotnet --info fails when installed as wget -O- https://dot.net/v1/dotnet-install.sh | bash /dev/stdin --channel 8.0 --install-dir /usr/local/bin.

Checking the coredump reveals that the offending instruction is in libhostfxr.so, but no further info.

# gdb dotnet core
GNU gdb (Debian 13.1-3) 13.1
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "arm-linux-gnueabihf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from dotnet...
(No debugging symbols found in dotnet)
[New LWP 393]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Core was generated by `dotnet --info'.
Program terminated with signal SIGILL, Illegal instruction.
#0  0xb6c86bd8 in ?? () from /usr/local/bin/host/fxr/8.0.10/libhostfxr.so
(gdb) disass
No function contains program counter for selected frame.
(gdb) where
#0  0xb6c86bd8 in ?? () from /usr/local/bin/host/fxr/8.0.10/libhostfxr.so
#1  0xb6c86bd0 in ?? () from /usr/local/bin/host/fxr/8.0.10/libhostfxr.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

A Marvell Armada is not the most common ARMv7-a configuration, since it lacks vfpv4 support, but https://github.com/dotnet/runtime/issues/9969 seems to indicat that vfpv3 should be enough. Besides, the shared library seems to be compiled with vfpv3.

# readelf -A /usr/local/bin/host/fxr/8.0.10/libhostfxr.so
Attribute Section: aeabi
File Attributes
  Tag_CPU_name: "7-A"
  Tag_CPU_arch: v7
  Tag_CPU_arch_profile: Application
  Tag_ARM_ISA_use: Yes
  Tag_THUMB_ISA_use: Thumb-2
  Tag_FP_arch: VFPv3-D16
  Tag_ABI_align_needed: 8-byte
  Tag_ABI_align_preserved: 8-byte, except leaf SP
  Tag_ABI_HardFP_use: Deprecated
  Tag_ABI_VFP_args: VFP registers

Anyone knows what could be going on?

dotnet-policy-service[bot] commented 2 weeks ago

Tagging subscribers to this area: @vitek-karas, @agocke, @vsadov See info in area-owners.md if you want to be subscribed.

KalleOlaviNiemitalo commented 2 weeks ago

IIRC, one can use x/i in GDB to disassemble at an arbitrary address, without needing a known function.

jpauwels commented 2 weeks ago

@KalleOlaviNiemitalo, thanks, that option indeed showed me more info.

Program received signal SIGILL, Illegal instruction.
0xb6c86bd8 in ?? () from /opt/dotnet8/host/fxr/8.0.10/libhostfxr.so
(gdb) disassem
No function contains program counter for selected frame.
(gdb) where
#0  0xb6c86bd8 in ?? () from /opt/dotnet8/host/fxr/8.0.10/libhostfxr.so
#1  0xb6c86bd0 in ?? () from /opt/dotnet8/host/fxr/8.0.10/libhostfxr.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) x/i 0xb6c86bd8
=> 0xb6c86bd8:  vcvt.f64.f32    d17, s18
(gdb) x/i 0xb6c86bd0
   0xb6c86bd0:  ldr.w   r0, [r10, #12]

If I read this correctly, the issue seems to be that d17 is accessed, whereas vfpv3d16 only provides 16 64-bit FPU registers. No idea how that instruction can get generated when Tag_FP_arch: VFPv3-D16 (see readelf output above).

filipnavara commented 2 weeks ago

The default build configuration (https://github.com/dotnet/runtime/blob/9e59acb298c20658788567e0c6f0793fe97d37f6/docs/workflow/building/coreclr/cross-building.md#cross-compiling-coreclr-for-other-vfp-configurations) is supposed to require ARMv7-A with VFPv3 and 16 64-bit or 32 32-bit FPU registers. The instruction is indeed odd.

filipnavara commented 2 weeks ago

It seems the documentation is incorrect/outdated confusing. The default build expects 32 64-bit registers: https://github.com/dotnet/runtime/blob/e70aaa8e2fba2f6caa934501f4b5373790cebe11/eng/native/configurecompiler.cmake#L750-L752

The documentation describes how to override that to build a version that supports VFPv3-D16.

Ref: https://github.com/dotnet/runtime/issues/9969

jpauwels commented 3 days ago

Okay, so it does seem that the default build is using VFPv3, which raises a few questions

  1. Why does readelf report otherwise?
  2. According to /proc/cpuinfo, my processor has vfpv3 capabilities too, so why does it crash? Is it possible that cpuinfo reports the capabilities incorrectly? It seems the same happened in the linked issue #9969, which also includes vfpv3 (aka VFPv3-D32) in its cpuinfo.
  3. (minor) Why does the build script include both CLR_ARM_FPU_TYPE and CLR_ARM_FPU_CAPABILITY, when (as far as I understand) a value of CLR_ARM_FPU_CAPABILITY=0x3 always needs to be matched with CLR_ARM_FPU_TYPE=vfpv3-d16 and CLR_ARM_FPU_CAPABILITY=0x7 with CLR_ARM_FPU_TYPE=vfpv3. It does not look like it's happening in practice, but this could allow invalid configurations.

I'll try building a runtime version that only relies on VFPv3-D16 again later, but so far I don't manage to get this working. Will open a separate issue if needed.