dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.27k stars 4.73k forks source link

createdump fails on RHEL 8/arm64 with "stack smashing detected" #108023

Closed omajid closed 1 month ago

omajid commented 1 month ago

Description

I am trying to run createdump against an ASP.NET Core application running on RHEL 8 on arm64/aarch64 . This works flawlessly with .NET 8, but fails with .NET 9.

Reproduction Steps

$ ~/dotnet/dotnet new web
$ ~/dotnet/dotnet run
info: Microsoft.Hosting.Lifetime[14]
      Now listening on: http://localhost:5082                                  
info: Microsoft.Hosting.Lifetime[0]
      Application started. Press Ctrl+C to shut down.                          
info: Microsoft.Hosting.Lifetime[0]
      Hosting environment: Development
info: Microsoft.Hosting.Lifetime[0]
      Content root path: /home/omajid/hello

# in a separate terminal
$ ~/dotnet/shared/Microsoft.NETCore.App/9.0.0-rc.1.24431.7/createdump -f dump 33966
[createdump] Gathering state for process 33966 dotnet
[createdump] Writing minidump with heap to file dump
[createdump] Written 267321344 bytes (4079 pages) to core file
[createdump] Target process is alive
*** stack smashing detected ***: <unknown> terminated
Aborted (core dumped)

Expected behavior

I get a dump

Actual behavior

Fails

Regression?

This was working in .NET 8. It was broken on both .NET 9 Preview 7 and .NET 9 RC 1.

Known Workarounds

No response

Configuration

This is the .NET 9 RC 1 SDK published by Microsoft:

$ ~/dotnet/dotnet --info
.NET SDK:
 Version:           9.0.100-rc.1.24452.12
 Commit:            81a714c6d3
 Workload version:  9.0.100-manifests.a7bf2b8f
 MSBuild version:   17.12.0-preview-24422-09+d17ec720d

Runtime Environment:
 OS Name:     rhel
 OS Version:  8
 OS Platform: Linux
 RID:         linux-arm64
 Base Path:   /home/omajid/dotnet/sdk/9.0.100-rc.1.24452.12/

.NET workloads installed:
Configured to use loose manifests when installing new manifests.
There are no installed workloads to display.

Host:
  Version:      9.0.0-rc.1.24431.7
  Architecture: arm64
  Commit:       static

.NET SDKs installed:
  9.0.100-rc.1.24452.12 [/home/omajid/dotnet/sdk]

.NET runtimes installed:
  Microsoft.AspNetCore.App 9.0.0-rc.1.24452.1 [/home/omajid/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 9.0.0-rc.1.24431.7 [/home/omajid/dotnet/shared/Microsoft.NETCore.App]

Other architectures found:
  None

Environment variables:
  Not set

global.json file:
  Not found

Learn more:
  https://aka.ms/dotnet/info

Download .NET:
  https://aka.ms/dotnet/download

This also reproduces with a self-built .NET 9 using the VMR/source-build.

$ cat /etc/os-release 
NAME="Red Hat Enterprise Linux"
VERSION="8.10 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.10"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.10 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8"
BUG_REPORT_URL="https://issues.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.10
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.10"

This only happens on arm64/aarch64. It doesn't happen on x64.

$ uname -m
aarch64

Other information

No response

omajid commented 1 month ago

cc @tmds

dotnet-policy-service[bot] commented 1 month ago

Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.

tmds commented 1 month ago

RHEL 8 on arm64/aarch64

We've had some issues in the past due to the 64kB page size (like https://github.com/dotnet/runtime/issues/91864), this may be another one of those.

mikem8361 commented 1 month ago

@omajid, is there any way you could catch the stack smashing under lldb? It is going to take me a while to put together an RHEL 8 arm64 device.

omajid commented 1 month ago
$ lldb /usr/lib64/dotnet/shared/Microsoft.NETCore.App/9.0.0-rc.1.24431.7/createdump
(lldb) target create "/usr/lib64/dotnet/shared/Microsoft.NETCore.App/9.0.0-rc.1.24431.7/createdump"                                                          
Current executable set to '/usr/lib64/dotnet/shared/Microsoft.NETCore.App/9.0.0-rc.1.24431.7/createdump' (aarch64).
(lldb) r -f dump 34680
Process 35063 launched: '/usr/lib64/dotnet/shared/Microsoft.NETCore.App/9.0.0-rc.1.24431.7/createdump' (aarch64)
[createdump] Gathering state for process 34680 dotnet
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
[createdump] Writing minidump with heap to file dump
Process 35063 stopped
* thread #1, name = 'createdump', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x1000000000000)                                
    frame #0: 0x0000fffff7b22d40 libc.so.6`__GI___memset_generic + 256
libc.so.6`__GI___memset_generic:
->  0xfffff7b22d40 <+256>: dc     zva, x3
    0xfffff7b22d44 <+260>: add    x3, x3, #0x40
    0xfffff7b22d48 <+264>: subs   x2, x2, #0x40
    0xfffff7b22d4c <+268>: b.hi   0xfffff7b22d40            ; <+256>
(lldb) bt
* thread #1, name = 'createdump', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x1000000000000)                                
  * frame #0: 0x0000fffff7b22d40 libc.so.6`__GI___memset_generic + 256
    frame #1: 0x0000aaaaaaac5ad0 createdump`DumpWriter::WriteDiagInfo(unsigned long) [inlined] memset(__dest=0x0000ffffffffa298, __ch=0, __len=65496) at string_fortified.h:74:10 [opt]
    frame #2: 0x0000aaaaaaac5ac0 createdump`DumpWriter::WriteDiagInfo(this=0x0000ffffffffa280, size=<unavailable>) at dumpwriter.cpp:50:5 [opt]              
    frame #3: 0x0000aaaaaaabfbf4 createdump`DumpWriter::WriteDump(this=0x0000ffffffffa280) at dumpwriterelf.cpp:181:18 [opt]                                 
    frame #4: 0x0000aaaaaaabc5f0 createdump`CreateDump(options=0x0000ffffffffe328) at createdumpunix.cpp:89:25 [opt]
(lldb) frame select 2
frame #2: 0x0000aaaaaaac5ac0 createdump`DumpWriter::WriteDiagInfo(this=0x0000ffffffffa280, size=<unavailable>) at dumpwriter.cpp:50:5 [opt]
   47       }
   48       size_t alignment = size - sizeof(header);
   49       assert(alignment < sizeof(m_tempBuffer));
-> 50       memset(m_tempBuffer, 0, alignment);
   51       if (!WriteData(m_tempBuffer, alignment)) {
   52           return false;
   53       }
(lldb) p alignment
(size_t) 65496
(lldb) p size
error: Couldn't materialize: couldn't get the value of variable size: Could not evaluate DW_OP_entry_value.                                                  
error: errored out in DoExecute, couldn't PrepareToExecuteJITExpression
(lldb) p sizeof(header)
(unsigned long) 40
(lldb) p sizeof(m_tempBuffer)
(unsigned long) 16384

Looks like we are trying to write 65496 bytes to a location that can only hold 16384 bytes.

$ getconf PAGE_SIZE
65536
$ python3 -c 'print(65536 - 40)'
65496

Yeah, looks like a page size issue like @tmds mentioned above.

omajid commented 1 month ago

Digging a bit, it looks like @tmds 's changes at https://github.com/dotnet/runtime/pull/91865 were reverted by https://github.com/dotnet/runtime/pull/95433. So the original issue in https://github.com/dotnet/runtime/issues/91864 has re-appeared.

mikem8361 commented 1 month ago

Thanks for figuring this out. Looks like I did the original fix trying to fix an assert on MacOS arm64. I'm looking into how to fix both.

mikem8361 commented 1 month ago

@omajid, is there anyway you could validate this fix (PR #108166) your issue?

omajid commented 1 month ago

Yes, I should be able to take a VMR checkout, apply this change and see if the resulting SDK has any issues or not. Looking at it now.

tommcdon commented 1 month ago

Moving issue to 9.0.0 for backport

omajid commented 1 month ago

I can confirm this PR makes things work for me again:

$ uname -m
aarch64
$ cat /etc/os-release 
NAME="Red Hat Enterprise Linux"
VERSION="8.10 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.10"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.10 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8"
BUG_REPORT_URL="https://issues.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.10
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.10"
$ ~/dotnet-sdk/dotnet --info
.NET SDK:
 Version:           9.0.100-rc.2.24474.1
 Commit:            1f747cd885
 Workload version:  9.0.100-manifests.934ebbcd
 MSBuild version:   17.12.0-preview-24469-05+1f747cd88

Runtime Environment:
 OS Name:     rhel
 OS Version:  8
 OS Platform: Linux
 RID:         rhel.8.10-arm64
 Base Path:   /home/omajid/dotnet-sdk/sdk/9.0.100-rc.2.24474.1/

.NET workloads installed:
There are no installed workloads to display.
Configured to use loose manifests when installing new manifests.

Host:
  Version:      9.0.0-rtm.24473.2
  Architecture: arm64
  Commit:       static

.NET SDKs installed:
  9.0.100-rc.2.24474.1 [/home/omajid/dotnet-sdk/sdk]

.NET runtimes installed:
  Microsoft.AspNetCore.App 9.0.0-rtm.24473.16 [/home/omajid/dotnet-sdk/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 9.0.0-rtm.24473.2 [/home/omajid/dotnet-sdk/shared/Microsoft.NETCore.App]

Other architectures found:
  None

Environment variables:
  Not set

global.json file:
  Not found

Learn more:
  https://aka.ms/dotnet/info

Download .NET:
  https://aka.ms/dotnet/download
$ ~/dotnet-sdk/shared/Microsoft.NETCore.App/9.0.0-rtm.24473.2/createdump 357990
[createdump] Gathering state for process 357990 dotnet
[createdump] Writing minidump with heap to file /tmp/coredump.357990
[createdump] Written 339873792 bytes (5186 pages) to core file
[createdump] Target process is alive
[createdump] Dump successfully written in 360ms

I tried again without the fix in https://github.com/dotnet/runtime/pull/108166 and it continues to crash, confirming that https://github.com/dotnet/runtime/pull/108166 is the fix.

omajid commented 1 month ago

Now that https://github.com/dotnet/runtime/pull/108166 and https://github.com/dotnet/runtime/pull/108208 have been merged, I am going to close this issue.

Thanks!