dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.08k stars 4.7k forks source link

#SIGSEGV in linux docker since upgrade to 6.0.10 #76921

Open dmitrykolchev opened 1 year ago

dmitrykolchev commented 1 year ago

Description

Hi!

Started getting a segmentation fault after upgrading runtime to 6.0.10

2022-10-12 13:28:48.208 +03:00 [DBG] [] (Td=22, User=) Microsoft.AspNetCore.Mvc.Razor.RazorViewEngine, View lookup cache miss for view '_WidgetAreaWindow' in controller 'Home'.
2022-10-12 13:28:49.062 +03:00 [DBG] [] (Td=8, User=) Orleans.Runtime.SafeTimer, Creating timer Orleans.Runtime.SafeTimerBase with dueTime=00:00:01 period=00:00:01
2022-10-12 13:28:49.081 +03:00 [INF] [] (Td=8, User=) Orleans.OutsideRuntimeClient, ---------- Initializing OutsideRuntimeClient on srv-sam5-002 at 10.0.54.45 Client Id = *cli/17a2958d ----------
2022-10-12 13:28:49.086 +03:00 [INF] [] (Td=8, User=) Orleans.OutsideRuntimeClient, ---------- Starting OutsideRuntimeClient with runtime Version='3.6.5. Commit Hash: 54382a15b653f80784520c9055614cbf429a1b16+54382a15b653f80784520c9055614cbf429a1b16 (Release).' in AppDomain=<AppDomain.Id=1, AppDomain.FriendlyName=SamApp>
2022-10-12 13:28:49.138 +03:00 [DBG] [] (Td=13, User=) Orleans.Runtime.SafeTimer, Creating timer Orleans.Runtime.AsyncTaskSafeTimer with dueTime=00:01:00 period=00:01:00
Segmentation fault
Microsoft.AspNetCore.App 6.0.10 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.NETCore.App 6.0.10 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

run under GDB

2022-10-12 14:03:59.929 +03:00 [DBG] [] (Td=13, User=) Microsoft.AspNetCore.Mvc.Razor.RazorViewEngine, View lookup cache miss for view 'Components/WidgetArea/Default' in controller 'Home'.
2022-10-12 14:03:59.997 +03:00 [DBG] [] (Td=13, User=) Microsoft.AspNetCore.Mvc.Razor.RazorViewEngine, View lookup cache miss for view '_WidgetAreaWindow' in controller 'Home'.
2022-10-12 14:04:00.870 +03:00 [DBG] [] (Td=16, User=) Orleans.Runtime.SafeTimer, Creating timer Orleans.Runtime.SafeTimerBase with dueTime=00:00:01 period=00:00:01
2022-10-12 14:04:00.889 +03:00 [INF] [] (Td=16, User=) Orleans.OutsideRuntimeClient, ---------- Initializing OutsideRuntimeClient on srv-sam5-002 at 10.0.54.45 Client Id = *cli/5df74aa6 ----------
2022-10-12 14:04:00.893 +03:00 [INF] [] (Td=16, User=) Orleans.OutsideRuntimeClient, ---------- Starting OutsideRuntimeClient with runtime Version='3.6.5. Commit Hash: 54382a15b653f80784520c9055614cbf429a1b16+54382a15b653f80784520c9055614cbf429a1b16 (Release).' in AppDomain=<AppDomain.Id=1, AppDomain.FriendlyName=SamApp>
2022-10-12 14:04:00.951 +03:00 [DBG] [] (Td=8, User=) Orleans.Runtime.SafeTimer, Creating timer Orleans.Runtime.AsyncTaskSafeTimer with dueTime=00:01:00 period=00:01:00

Thread 21 ".NET ThreadPool" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffa277fe700 (LWP 19511)]
0x0000000000000000 in ?? ()
(gdb) info regsiters
Undefined info command: "regsiters".  Try "help info".
(gdb) info registers
rax            0x7ffff6d520c0   140737334550720
rbx            0x19     25
rcx            0x7ffa277fd0e4   140712381239524
rdx            0x0      0
rsi            0x0      0
rdi            0x5555559e59d0   93824997022160
rbp            0x7ffa277fd0d0   0x7ffa277fd0d0
rsp            0x7ffa277fd058   0x7ffa277fd058
r8             0x0      0
r9             0x19     25
r10            0x6      6
r11            0x0      0
r12            0x5555559e56c0   93824997021376
r13            0x7ffa277fd0e4   140712381239524
r14            0x5555559e59d0   93824997022160
r15            0x19     25
rip            0x0      0x0
eflags         0x10246  [ PF ZF IF RF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0

(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff6b10c4f in ?? () from /usr/share/dotnet/shared/Microsoft.NETCore.App/6.0.10/libcoreclr.so
#2  0x00007ffff6b11e5b in ?? () from /usr/share/dotnet/shared/Microsoft.NETCore.App/6.0.10/libcoreclr.so
#3  0x00007fff7cce537c in ?? ()
#4  0x0000000000eca806 in ?? ()
#5  0x00007ffff6d128f0 in ?? () from /usr/share/dotnet/shared/Microsoft.NETCore.App/6.0.10/libcoreclr.so
#6  0x00007ffa277fdd10 in ?? ()
#7  0x0000000000000000 in ?? ()

Reproduction Steps

all my .net core applications failed to start since image updated

image

Expected behavior

applications run without faults

Actual behavior

getint SIGSEGV in linux when I try run application under gdb

2022-10-12 14:04:00.870 +03:00 [DBG] [] (Td=16, User=) Orleans.Runtime.SafeTimer, Creating timer Orleans.Runtime.SafeTimerBase with dueTime=00:00:01 period=00:00:01
2022-10-12 14:04:00.889 +03:00 [INF] [] (Td=16, User=) Orleans.OutsideRuntimeClient, ---------- Initializing OutsideRuntimeClient on srv-sam5-002 at 10.0.54.45 Client Id = *cli/5df74aa6 ----------
2022-10-12 14:04:00.893 +03:00 [INF] [] (Td=16, User=) Orleans.OutsideRuntimeClient, ---------- Starting OutsideRuntimeClient with runtime Version='3.6.5. Commit Hash: 54382a15b653f80784520c9055614cbf429a1b16+54382a15b653f80784520c9055614cbf429a1b16 (Release).' in AppDomain=<AppDomain.Id=1, AppDomain.FriendlyName=SamApp>
2022-10-12 14:04:00.951 +03:00 [DBG] [] (Td=8, User=) Orleans.Runtime.SafeTimer, Creating timer Orleans.Runtime.AsyncTaskSafeTimer with dueTime=00:01:00 period=00:01:00

Thread 21 ".NET ThreadPool" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffa277fe700 (LWP 19511)]
0x0000000000000000 in ?? ()

Regression?

No response

Known Workarounds

No response

Configuration

Distributor ID: Debian
Description:    Debian GNU/Linux 9.13 (stretch)
Release:        9.13
Codename:       stretch

Docker version 19.03.15, build 99e3ed8919

Other information

No response

dotnet-issue-labeler[bot] commented 1 year ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

NikolaMilosavljevic commented 1 year ago

[Triage] @dmitrykolchev , can you provide concrete repro steps?

ghost commented 1 year ago

This issue has been marked needs-author-action and may be missing some important information.

dmitrykolchev commented 1 year ago

@NikolaMilosavljevic Unfortunately I can't, it's a fairly large system. I found a workaround to get the system up and running. I publish the system using self-contained deployment mode with linux-x64 runtime. When I use framework-dependent deployment mode and portable target runtime all applications failed to start without any .net runtime exception. As I wrote above, this behavior appeared after using the docker image with aspnet core 6.0.10

botinko commented 1 year ago

Hello! Looks like we hit same problem. We was able to collect useful diagnostics data:

(lldb) bt
* thread #1, name = 'ServiceTitan.Fo', stop reason = signal SIGSEGV: invalid address (fault address: 0x8b8)
  * frame #0: 0x00007ffff75d66a4 libcoreclr.so`SVR::GCHeap::AssignHeap(alloc_context*) [inlined] SVR::GCHeap::GetHeap(n=18) at gc.cpp:44896:33
    frame #1: 0x00007ffff75d6696 libcoreclr.so`SVR::GCHeap::AssignHeap(acontext=0x000055555568d668) at gc.cpp:44889
    frame #2: 0x00007ffff75d64f7 libcoreclr.so`SVR::GCHeap::Alloc(this=<unavailable>, context=0x000055555568d668, size=8184, flags=66) at gc.cpp:43628:9
    frame #3: 0x00007ffff74a4127 libcoreclr.so`AllocateSzArray(MethodTable*, int, GC_ALLOC_FLAGS) at gchelpers.cpp:228:48
    frame #4: 0x00007ffff74a40bf libcoreclr.so`AllocateSzArray(pArrayMT=<unavailable>, cElements=1020, flags=GC_ALLOC_CONTAINS_REF | GC_ALLOC_PINNED_OBJECT_HEAP) at gchelpers.cpp:0
    frame #5: 0x00007ffff7315a48 libcoreclr.so`PinnedHeapHandleTable::AllocateHandles(unsigned int) at appdomain.cpp:150:35
    frame #6: 0x00007ffff7315a24 libcoreclr.so`PinnedHeapHandleTable::AllocateHandles(this=0x00005555556d7d60, nRequested=<unavailable>) at appdomain.cpp:454
    frame #7: 0x00007ffff7316c89 libcoreclr.so`BaseDomain::AllocateObjRefPtrsInLargeTable(this=0x0000555555674e90, nRequested=<unavailable>, ppLazyAllocate=0x00007fff7e204610) at appdomain.cpp:896:55
    frame #8: 0x00007ffff7317b25 libcoreclr.so`SystemDomain::LoadBaseSystemClasses(this=<unavailable>) at appdomain.cpp:1454:33
    frame #9: 0x00007ffff731776d libcoreclr.so`SystemDomain::Init(this=0x00007ffff79938c0) at appdomain.cpp:1266:5
    frame #10: 0x00007ffff77632ac libcoreclr.so`EEStartupHelper() at ceemain.cpp:990:33
    frame #11: 0x00007ffff77626b9 libcoreclr.so`EEStartup() [inlined] EEStartup(this=<unavailable>, p=<unavailable>)::$_0::operator()(void*) const at ceemain.cpp:1153:9
    frame #12: 0x00007ffff77625bc libcoreclr.so`EEStartup() at ceemain.cpp:1155
    frame #13: 0x00007ffff776251d libcoreclr.so`EnsureEEStarted() at ceemain.cpp:321:17
    frame #14: 0x00007ffff736085e libcoreclr.so`CorHost2::Start(this=0x00005555555a50e0) at corhost.cpp:101:14
    frame #15: 0x00007ffff7313c45 libcoreclr.so`::coreclr_initialize(exePath=<unavailable>, appDomainFriendlyName=<unavailable>, propertyCount=11, propertyKeys=<unavailable>, propertyValues=<unavailable>, hostHandle=0x00007fffffffd818, domainId=0x00007fffffffd814) at unixinterface.cpp:251:16
    frame #16: 0x00007ffff79dd66f libhostpolicy.so`coreclr_t::create(libcoreclr_path=<unavailable>, exe_path="/app/ServiceTitan.Forms.Api", app_domain_friendly_name="clrhost", properties=0x000055555558d308, inst=nullptr) at coreclr.cpp:58:10
    frame #17: 0x00007ffff79edba1 libhostpolicy.so`(anonymous namespace)::create_coreclr() at hostpolicy.cpp:74:23
    frame #18: 0x00007ffff79ed45a libhostpolicy.so`::corehost_main(argc=1, argv=0x00007fffffffddc8) at hostpolicy.cpp:426:10
    frame #19: 0x00007ffff7a46d14 libhostfxr.so`fx_muxer_t::handle_exec_host_command(std::string const&, host_startup_info_t const&, std::string const&, std::unordered_map<known_options, std::vector<std::string, std::allocator<std::string> >, known_options_hash, std::equal_to<known_options>, std::allocator<std::pair<kno
wn_options const, std::vector<std::string, std::allocator<std::string> > > > > const&, int, char const**, int, host_mode_t, bool, char*, int, int*) at fx_muxer.cpp:146:20
    frame #20: 0x00007ffff7a46be7 libhostfxr.so`fx_muxer_t::handle_exec_host_command(std::string const&, host_startup_info_t const&, std::string const&, std::unordered_map<known_options, std::vector<std::string, std::allocator<std::string> >, known_options_hash, std::equal_to<known_options>, std::allocator<std::pair<kno
wn_options const, std::vector<std::string, std::allocator<std::string> > > > > const&, int, char const**, int, host_mode_t, bool, char*, int, int*) [inlined] (anonymous namespace)::read_config_and_execute(host_command=<unavailable>, host_info=<unavailable>, app_candidate=error: summary string parsing error, opts=0x00007
ffff79ed3c0, new_argc=1, new_argv=0x00007fffffffddc8, mode=<unavailable>, is_sdk_command=<unavailable>, out_buffer=<unavailable>, buffer_size=<unavailable>, required_buffer_size=<unavailable>) at fx_muxer.cpp:533
    frame #21: 0x00007ffff7a46940 libhostfxr.so`fx_muxer_t::handle_exec_host_command(host_command=<unavailable>, host_info=<unavailable>, app_candidate=<unavailable>, opts=<unavailable>, argc=<unavailable>, argv=<unavailable>, argoff=1, mode=apphost, is_sdk_command=<unavailable>, result_buffer=0x0000000000000000, buffer
_size=0, required_buffer_size=0x0000000000000000) at fx_muxer.cpp:1018
    frame #22: 0x00007ffff7a45449 libhostfxr.so`fx_muxer_t::execute(host_command=error: summary string parsing error, argc=1, argv=0x00007fffffffddc8, host_info=0x00007fffffffdb90, result_buffer=0x0000000000000000, buffer_size=0, required_buffer_size=0x0000000000000000) at fx_muxer.cpp:579:18
    frame #23: 0x00007ffff7a4093b libhostfxr.so`::hostfxr_main_startupinfo(argc=1, argv=0x00007fffffffddc8, host_path="/app/ServiceTitan.Forms.Api", dotnet_root="/usr/share/dotnet", app_path="/app/ServiceTitan.Forms.Api.dll") at hostfxr.cpp:61:12
    frame #24: 0x0000555555564a25 ServiceTitan.Forms.Api`exe_start(argc=1, argv=0x00007fffffffddc8) at corehost.cpp:235:18
    frame #25: 0x0000555555564ef0 ServiceTitan.Forms.Api`main(argc=1, argv=0x00007fffffffddc8) at corehost.cpp:301:21
    frame #26: 0x00007ffff7ac3d0a libc.so.6`__libc_start_main + 234
    frame #27: 0x0000555555558d7a ServiceTitan.Forms.Api`_start + 41
(lldb) dumpstack
OS Thread Id: 0xfd3 (1)
TEB information is not available so a stack size of 0xFFFF is assumed
Current frame: libcoreclr.so!SVR::GCHeap::AssignHeap(alloc_context*) + 0xf4 [/__w/1/s/src/coreclr/gc/gc.cpp:44896]
Child-SP         RetAddr          Caller, Callee
00007FFFFFFFD3A0 00007ffff75d64f7 libcoreclr.so!SVR::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int) + 0xd7 [/__w/1/s/src/coreclr/gc/gc.h:233], calling libcoreclr.so!SVR::GCHeap::AssignHeap(alloc_context*) [/__w/1/s/src/coreclr/gc/gc.cpp:44887]
00007FFFFFFFD3E0 00007ffff74a4127 libcoreclr.so!AllocateSzArray(MethodTable*, int, GC_ALLOC_FLAGS) + 0x137 [/__w/1/s/src/coreclr/vm/gchelpers.cpp:239]
00007FFFFFFFD440 00007ffff7315a48 libcoreclr.so!PinnedHeapHandleTable::AllocateHandles(unsigned int) + 0x1a8 [/__w/1/s/src/coreclr/vm/appdomain.cpp:0], calling libcoreclr.so!AllocateObjectArray(unsigned int, TypeHandle, int) [/__w/1/s/src/coreclr/vm/gchelpers.cpp:806]
00007FFFFFFFD480 00007ffff7316c89 libcoreclr.so!BaseDomain::AllocateObjRefPtrsInLargeTable(int, Object***) + 0xc9 [/__w/1/s/src/coreclr/vm/appdomain.cpp:0], calling libcoreclr.so!PinnedHeapHandleTable::AllocateHandles(unsigned int) [/__w/1/s/src/coreclr/vm/appdomain.cpp:385]
00007FFFFFFFD4D0 00007ffff7317b25 libcoreclr.so!SystemDomain::LoadBaseSystemClasses() + 0x1e5 [/__w/1/s/src/coreclr/vm/appdomain.cpp:1458], calling libcoreclr.so!Module::AllocateRegularStaticHandles(AppDomain*) [/__w/1/s/src/coreclr/vm/ceeload.cpp:2739]
00007FFFFFFFD4F0 00007ffff731776d libcoreclr.so!SystemDomain::Init() + 0x22d [/__w/1/s/src/coreclr/vm/threads.inl:42], calling libcoreclr.so!SystemDomain::LoadBaseSystemClasses() [/__w/1/s/src/coreclr/vm/appdomain.cpp:1390]
00007FFFFFFFD560 00007ffff77632ac libcoreclr.so!EEStartupHelper() + 0x6ac [/__w/1/s/src/coreclr/vm/ceemain.cpp:998], calling libcoreclr.so!SystemDomain::Init() [/__w/1/s/src/coreclr/vm/appdomain.cpp:1212]
00007FFFFFFFD5F0 00007ffff77626b9 libcoreclr.so!EEStartup() + 0x169 [/__w/1/s/src/coreclr/pal/inc/pal.h:4656], calling libcoreclr.so!EEStartupHelper() [/__w/1/s/src/coreclr/vm/ceemain.cpp:616]
00007FFFFFFFD660 00007ffff776251d libcoreclr.so!EnsureEEStarted() + 0x12d [/__w/1/s/src/coreclr/inc/volatile.h:182], calling libcoreclr.so!EEStartup() [/__w/1/s/src/coreclr/vm/ceemain.cpp:1137]
00007FFFFFFFD680 00007ffff736085e libcoreclr.so!CorHost2::Start() + 0x6e [/__w/1/s/src/coreclr/vm/corhost.cpp:102], calling libcoreclr.so!EnsureEEStarted() [/__w/1/s/src/coreclr/vm/ceemain.cpp:278]
00007FFFFFFFD6A0 00007ffff7313c45 libcoreclr.so!coreclr_initialize + 0x135 [/__w/1/s/src/coreclr/dlls/mscoree/unixinterface.cpp:0]
00007FFFFFFFD730 00007ffff79dd66f libhostpolicy.so!coreclr_t::create(std::string const&, char const*, char const*, coreclr_property_bag_t const&, std::unique_ptr<coreclr_t, std::default_delete<coreclr_t> >&) + 0x30f [/root/runtime/src/native/corehost/hostpolicy/coreclr.cpp:0]
00007FFFFFFFD850 00007ffff79edba1 libhostpolicy.so!(anonymous namespace)::create_coreclr() + 0x181 [/root/runtime/src/native/corehost/hostpolicy/hostpolicy.cpp:0], calling libhostpolicy.so!coreclr_t::create(std::string const&, char const*, char const*, coreclr_property_bag_t const&, std::unique_ptr<coreclr_t, std::defau
lt_delete<coreclr_t> >&) [/root/runtime/src/native/corehost/hostpolicy/coreclr.cpp:29]
00007FFFFFFFD880 00007ffff79ed45a libhostpolicy.so!corehost_main + 0x9a [/root/runtime/src/native/corehost/hostpolicy/hostpolicy.cpp:0], calling libhostpolicy.so!(anonymous namespace)::create_coreclr() [/root/runtime/src/native/corehost/hostpolicy/hostpolicy.cpp:48]
00007FFFFFFFD960 00007ffff7a46d14 libhostfxr.so!fx_muxer_t::handle_exec_host_command(std::string const&, host_startup_info_t const&, std::string const&, std::unordered_map<known_options, std::vector<std::string, std::allocator<std::string> >, known_options_hash, std::equal_to<known_options>, std::allocator<std::pair<kno
wn_options const, std::vector<std::string, std::allocator<std::string> > > > > const&, int, char const**, int, host_mode_t, bool, char*, int, int*) + 0x714 [/root/runtime/src/native/corehost/fxr/fx_muxer.cpp:0]
00007FFFFFFFDA90 00007ffff7a45449 libhostfxr.so!fx_muxer_t::execute(std::string, int, char const**, host_startup_info_t const&, char*, int, int*) + 0x299 [/root/runtime/src/native/corehost/fxr/fx_muxer.cpp:579], calling libhostfxr.so!fx_muxer_t::handle_exec_host_command(std::string const&, host_startup_info_t const&, st
d::string const&, std::unordered_map<known_options, std::vector<std::string, std::allocator<std::string> >, known_options_hash, std::equal_to<known_options>, std::allocator<std::pair<known_options const, std::vector<std::string, std::allocator<std::string> > > > > const&, int, char const**, int, host_mode_t, bool, char*
, int, int*) [/root/runtime/src/native/corehost/fxr/fx_muxer.cpp:1001]
00007FFFFFFFDB30 00007ffff7a5d5a5 libhostfxr.so!trace::setup() + 0x35 [/root/runtime/src/native/corehost/hostmisc/trace.cpp:26], calling libhostfxr.so!pal::getenv(char const*, std::string*) [/root/runtime/src/native/corehost/hostmisc/pal.unix.cpp:848]
00007FFFFFFFDB70 00007ffff7a4093b libhostfxr.so!hostfxr_main_startupinfo + 0xab [/root/runtime/src/native/corehost/fxr/hostfxr.cpp:0], calling libhostfxr.so!fx_muxer_t::execute(std::string, int, char const**, host_startup_info_t const&, char*, int, int*) [/root/runtime/src/native/corehost/fxr/fx_muxer.cpp:556]
00007FFFFFFFDBE0 0000555555564a25 ServiceTitan.Forms.Api!exe_start(int, char const**) + 0x415 [/root/runtime/src/native/corehost/corehost.cpp:0]
00007FFFFFFFDC50 0000555555559215 ServiceTitan.Forms.Api!trace::setup() + 0x35 [/root/runtime/src/native/corehost/hostmisc/trace.cpp:26], calling ServiceTitan.Forms.Api!pal::getenv(char const*, std::string*) [/root/runtime/src/native/corehost/hostmisc/pal.unix.cpp:848]
00007FFFFFFFDC90 0000555555564ef0 ServiceTitan.Forms.Api!main + 0x90 [/root/runtime/src/native/corehost/corehost.cpp:301], calling ServiceTitan.Forms.Api!exe_start(int, char const**) [/root/runtime/src/native/corehost/corehost.cpp:97]
00007FFFFFFFDCD0 00007ffff7ac3d0a libc.so.6!__libc_start_main + 0xea
00007FFFFFFFDDA0 0000555555558d7a ServiceTitan.Forms.Api!_start + 0x29, calling ServiceTitan.Forms.Api!__libc_start_main
# dotnet --info
.NET SDK (reflecting any global.json):
 Version:   6.0.402
 Commit:    6862418796

Runtime Environment:
 OS Name:     debian
 OS Version:  11
 OS Platform: Linux
 RID:         debian.11-x64
 Base Path:   /usr/share/dotnet/sdk/6.0.402/

global.json file:
  Not found

Host:
  Version:      6.0.10
  Architecture: x64
  Commit:       5a400c212a

.NET SDKs installed:
  6.0.402 [/usr/share/dotnet/sdk]

.NET runtimes installed:
  Microsoft.AspNetCore.App 6.0.10 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 6.0.10 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

Download .NET:
  https://aka.ms/dotnet-download

Learn about .NET Runtimes and SDKs:
  https://aka.ms/dotnet/runtimes-sdk-info
# lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   48 bits physical, 48 bits virtual
CPU(s):                          32
On-line CPU(s) list:             0-31
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       1
NUMA node(s):                    4
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7452 32-Core Processor
Stepping:                        0
CPU MHz:                         2345.606
BogoMIPS:                        4691.21
Hypervisor vendor:               Microsoft
Virtualization type:             full
L1d cache:                       512 KiB
L1i cache:                       512 KiB
L2 cache:                        8 MiB
L3 cache:                        64 MiB
NUMA node0 CPU(s):               0-7
NUMA node1 CPU(s):               8-15
NUMA node2 CPU(s):               16-23
NUMA node3 CPU(s):               24-31

App may crash at startup or after some activity.

We use COMPlus_GCHeapCount=8 and COMPlus_GCHeapHardLimitPercent=0x5A

After unset COMPlus_GCHeapCount issue gone away.

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.

Issue Details
### Description Hi! Started getting a segmentation fault after upgrading runtime to 6.0.10 ``` 2022-10-12 13:28:48.208 +03:00 [DBG] [] (Td=22, User=) Microsoft.AspNetCore.Mvc.Razor.RazorViewEngine, View lookup cache miss for view '_WidgetAreaWindow' in controller 'Home'. 2022-10-12 13:28:49.062 +03:00 [DBG] [] (Td=8, User=) Orleans.Runtime.SafeTimer, Creating timer Orleans.Runtime.SafeTimerBase with dueTime=00:00:01 period=00:00:01 2022-10-12 13:28:49.081 +03:00 [INF] [] (Td=8, User=) Orleans.OutsideRuntimeClient, ---------- Initializing OutsideRuntimeClient on srv-sam5-002 at 10.0.54.45 Client Id = *cli/17a2958d ---------- 2022-10-12 13:28:49.086 +03:00 [INF] [] (Td=8, User=) Orleans.OutsideRuntimeClient, ---------- Starting OutsideRuntimeClient with runtime Version='3.6.5. Commit Hash: 54382a15b653f80784520c9055614cbf429a1b16+54382a15b653f80784520c9055614cbf429a1b16 (Release).' in AppDomain= 2022-10-12 13:28:49.138 +03:00 [DBG] [] (Td=13, User=) Orleans.Runtime.SafeTimer, Creating timer Orleans.Runtime.AsyncTaskSafeTimer with dueTime=00:01:00 period=00:01:00 Segmentation fault ``` ``` Microsoft.AspNetCore.App 6.0.10 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App] Microsoft.NETCore.App 6.0.10 [/usr/share/dotnet/shared/Microsoft.NETCore.App] ``` run under GDB ``` 2022-10-12 14:03:59.929 +03:00 [DBG] [] (Td=13, User=) Microsoft.AspNetCore.Mvc.Razor.RazorViewEngine, View lookup cache miss for view 'Components/WidgetArea/Default' in controller 'Home'. 2022-10-12 14:03:59.997 +03:00 [DBG] [] (Td=13, User=) Microsoft.AspNetCore.Mvc.Razor.RazorViewEngine, View lookup cache miss for view '_WidgetAreaWindow' in controller 'Home'. 2022-10-12 14:04:00.870 +03:00 [DBG] [] (Td=16, User=) Orleans.Runtime.SafeTimer, Creating timer Orleans.Runtime.SafeTimerBase with dueTime=00:00:01 period=00:00:01 2022-10-12 14:04:00.889 +03:00 [INF] [] (Td=16, User=) Orleans.OutsideRuntimeClient, ---------- Initializing OutsideRuntimeClient on srv-sam5-002 at 10.0.54.45 Client Id = *cli/5df74aa6 ---------- 2022-10-12 14:04:00.893 +03:00 [INF] [] (Td=16, User=) Orleans.OutsideRuntimeClient, ---------- Starting OutsideRuntimeClient with runtime Version='3.6.5. Commit Hash: 54382a15b653f80784520c9055614cbf429a1b16+54382a15b653f80784520c9055614cbf429a1b16 (Release).' in AppDomain= 2022-10-12 14:04:00.951 +03:00 [DBG] [] (Td=8, User=) Orleans.Runtime.SafeTimer, Creating timer Orleans.Runtime.AsyncTaskSafeTimer with dueTime=00:01:00 period=00:01:00 Thread 21 ".NET ThreadPool" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffa277fe700 (LWP 19511)] 0x0000000000000000 in ?? () (gdb) info regsiters Undefined info command: "regsiters". Try "help info". (gdb) info registers rax 0x7ffff6d520c0 140737334550720 rbx 0x19 25 rcx 0x7ffa277fd0e4 140712381239524 rdx 0x0 0 rsi 0x0 0 rdi 0x5555559e59d0 93824997022160 rbp 0x7ffa277fd0d0 0x7ffa277fd0d0 rsp 0x7ffa277fd058 0x7ffa277fd058 r8 0x0 0 r9 0x19 25 r10 0x6 6 r11 0x0 0 r12 0x5555559e56c0 93824997021376 r13 0x7ffa277fd0e4 140712381239524 r14 0x5555559e59d0 93824997022160 r15 0x19 25 rip 0x0 0x0 eflags 0x10246 [ PF ZF IF RF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 (gdb) bt #0 0x0000000000000000 in ?? () #1 0x00007ffff6b10c4f in ?? () from /usr/share/dotnet/shared/Microsoft.NETCore.App/6.0.10/libcoreclr.so #2 0x00007ffff6b11e5b in ?? () from /usr/share/dotnet/shared/Microsoft.NETCore.App/6.0.10/libcoreclr.so #3 0x00007fff7cce537c in ?? () #4 0x0000000000eca806 in ?? () #5 0x00007ffff6d128f0 in ?? () from /usr/share/dotnet/shared/Microsoft.NETCore.App/6.0.10/libcoreclr.so #6 0x00007ffa277fdd10 in ?? () #7 0x0000000000000000 in ?? () ``` ### Reproduction Steps all my .net core applications failed to start since image updated ![image](https://user-images.githubusercontent.com/2192524/195329812-58dad47a-1b97-469b-b56c-f717932a0e0a.png) ### Expected behavior applications run without faults ### Actual behavior getint SIGSEGV in linux when I try run application under gdb ``` 2022-10-12 14:04:00.870 +03:00 [DBG] [] (Td=16, User=) Orleans.Runtime.SafeTimer, Creating timer Orleans.Runtime.SafeTimerBase with dueTime=00:00:01 period=00:00:01 2022-10-12 14:04:00.889 +03:00 [INF] [] (Td=16, User=) Orleans.OutsideRuntimeClient, ---------- Initializing OutsideRuntimeClient on srv-sam5-002 at 10.0.54.45 Client Id = *cli/5df74aa6 ---------- 2022-10-12 14:04:00.893 +03:00 [INF] [] (Td=16, User=) Orleans.OutsideRuntimeClient, ---------- Starting OutsideRuntimeClient with runtime Version='3.6.5. Commit Hash: 54382a15b653f80784520c9055614cbf429a1b16+54382a15b653f80784520c9055614cbf429a1b16 (Release).' in AppDomain= 2022-10-12 14:04:00.951 +03:00 [DBG] [] (Td=8, User=) Orleans.Runtime.SafeTimer, Creating timer Orleans.Runtime.AsyncTaskSafeTimer with dueTime=00:01:00 period=00:01:00 Thread 21 ".NET ThreadPool" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffa277fe700 (LWP 19511)] 0x0000000000000000 in ?? () ``` ### Regression? _No response_ ### Known Workarounds _No response_ ### Configuration ``` Distributor ID: Debian Description: Debian GNU/Linux 9.13 (stretch) Release: 9.13 Codename: stretch ``` Docker version 19.03.15, build 99e3ed8919 ### Other information _No response_
Author: dmitrykolchev
Assignees: -
Labels: `area-GC-coreclr`, `needs-further-triage`
Milestone: -
mangod9 commented 1 year ago

hi @botinko @dmitrykolchev did you start hitting this issue after moving from 6.0.9 to 6.0.10 or a previous major version? Would you be able to share a dump privately so we can investigate? thanks

botinko commented 1 year ago

@mangod9 It started happening after upgrade from latest dotnet 5 to 6.0.10. I cannot give a dump, because coredump generated by CLR (via COMPlus_DbgEnableMiniDump) doesn't contain needed data. It shows like all threads is in SIGABRT and I can't find problematic stack. I got all information by running my app under lldb. Maybe it's possible to make a dump from lldb session, but I won't find how. Also dump contains sensitive data. I think it will be possible to create repro, but it will require additional work. I still able to reproduce issue on our stage env and gather needed data.

mangod9 commented 1 year ago

Looks like the issue is happening on startup from the stack you provided. So if you create a simple hello world app does the issue repro in that container (and hardware)? Also looks like its failing to find a heap, are you running on hardware with multiple NUMA nodes possibly and are you restricting CPUs for the container?

botinko commented 1 year ago
Model name:                      AMD EPYC 7452 32-Core Processor
NUMA node0 CPU(s):               0-7
NUMA node1 CPU(s):               8-15
NUMA node2 CPU(s):               16-23
NUMA node3 CPU(s):               24-31

For this pod we don't set CPU limit, but we set COMPlus_GCHeapCount=8. Also found very similar issue https://github.com/dotnet/runtime/issues/67008

mangod9 commented 1 year ago

Ok thanks. Yeah this seems to be a dupe of https://github.com/dotnet/runtime/issues/67008. Looks like there are only 8 heaps per your config but there is a discrepancy where the GC is still trying to find Heap 18. Guessing if you restrict the CPUs to 8 on the container it might work around the issue.

dmitrykolchev commented 1 year ago

@mangod9

hi @botinko @dmitrykolchev did you start hitting this issue after moving from 6.0.9 to 6.0.10

We have no issues with 6.0.9 and all previous releases of .NET 6 runtime. This problem started on october 11, 2022 when docker image was updated to 6.0.10. We test nightly builds every day, so I know for sure the date when applications started to crash

mangod9 commented 1 year ago

Looking through changes in 6.0.10, I dont see anything that stands out which might be causing it. Since you are observing that all applications are failing when deployed as framework dependent, perhaps you observe the same behavior for a simple webapp? We will try to repro as well with that docker image.

mangod9 commented 1 year ago

@dmitrykolchev, havent been able to repro it locally. Are you able to share a dump or a container with a repro? Thx

Iliya-usov commented 1 year ago

Hi! It looks like we have a similar issue with server gc on linux We set DOTNET_GCHeapCount=2 and DOTNET_GCNoAffinitize=1 Unsetting DOTNET_GCHeapCount fixes the problem

stack.txt

(lldb) bt all
* thread #1, stop reason = signal SIGSEGV
  * frame #0: 0x00007f43fe30aaa4 libcoreclr.so`SVR::gc_heap::balance_heaps_uoh(alloc_context*, unsigned long, int) [inlined] SVR::GCHeap::GetHeap(n=12) at gc.cpp:44894:33
    frame #1: 0x00007f43fe30aa96 libcoreclr.so`SVR::gc_heap::balance_heaps_uoh(acontext=<unavailable>, alloc_size=<unavailable>, generation_num=4) at gc.cpp:17324:24
    frame #2: 0x00007f43fe30adfb libcoreclr.so`SVR::gc_heap::allocate_more_space(acontext=0x00007fff8256b7e0, size=4120, flags=66, alloc_generation_number=4) at gc.cpp:17440:30
    frame #3: 0x00007f43fe3357dd libcoreclr.so`SVR::gc_heap::allocate_uoh_object(this=0x000055eb09bbcbb0, jsize=<unavailable>, flags=66, gen_number=<unavailable>, alloc_bytes=0x000055eb09b9ca10) at gc.cpp:39367:11
    frame #4: 0x00007f43fe3392d8 libcoreclr.so`SVR::GCHeap::Alloc(this=<unavailable>, context=<unavailable>, size=4120, flags=66) at gc.cpp:43651:34
    frame #5: 0x00007f43fe207017 libcoreclr.so`AllocateSzArray(MethodTable*, int, GC_ALLOC_FLAGS) at gchelpers.cpp:228:48
    frame #6: 0x00007f43fe206faf libcoreclr.so`AllocateSzArray(pArrayMT=<unavailable>, cElements=512, flags=GC_ALLOC_CONTAINS_REF | GC_ALLOC_PINNED_OBJECT_HEAP) at gchelpers.cpp:0
    frame #7: 0x00007f43fe078a48 libcoreclr.so`PinnedHeapHandleTable::AllocateHandles(unsigned int) at appdomain.cpp:150:35
    frame #8: 0x00007f43fe078a24 libcoreclr.so`PinnedHeapHandleTable::AllocateHandles(this=0x000055eb09abde10, nRequested=<unavailable>) at appdomain.cpp:454:23
    frame #9: 0x00007f43fe2948b6 libcoreclr.so`GlobalStringLiteralMap::AddStringLiteral(EEStringData*) [inlined] PinnedHeapHandleBlockHolder::PinnedHeapHandleBlockHolder(this=<unavailable>, pOwner=<unavailable>, nCount=1) at appdomain.hpp:593:26

You can get the coredump here https://drive.google.com/file/d/1-suS-vhS8RE9jJZf8ek-AfH69msm8CXY/view?usp=share_link

mangod9 commented 1 year ago

ok, thanks. Yeah the multi-NUMA + DOTNET_GCHeapCount is understood. Looks like the original issue is probably different though.

loop-evgeny commented 12 months ago

We're getting crashes like this with GCHeapCount set to 1-9 without setting GCNoAffinitize but with ServerGarbageCollection=true on certain servers (but not others). Reproducible with a trivial console EXE like this:

heapcount.csproj:

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net6.0</TargetFramework>
    <ServerGarbageCollection>true</ServerGarbageCollection>
  </PropertyGroup>
</Project>

Program.cs:

System.Console.WriteLine("Hello, World!");

I build a self-contained EXE in my dev VM with /usr/bin/dotnet publish -c Release --self-contained -r linux-x64 -o bin/published and upload the output to several servers. On 2 of them it crashes, like this:

evgeny@medusa:~/heapcount$ DOTNET_GCHeapCount=2 ./heapcount 
Segmentation fault (core dumped)
evgeny@medusa:~/heapcount$ DOTNET_GCHeapCount=1 ./heapcount 
Hello, World!
evgeny@medusa:~/heapcount$ DOTNET_GCHeapCount=1 ./heapcount 
Hello, World!
evgeny@medusa:~/heapcount$ DOTNET_GCHeapCount=1 ./heapcount 
Hello, World!
evgeny@medusa:~/heapcount$ DOTNET_GCHeapCount=1 ./heapcount 
Hello, World!
evgeny@medusa:~/heapcount$ DOTNET_GCHeapCount=2 ./heapcount 
Segmentation fault (core dumped)
evgeny@medusa:~/heapcount$ DOTNET_GCHeapCount=2 ./heapcount 
Segmentation fault (core dumped)
evgeny@medusa:~/heapcount$ DOTNET_GCHeapCount=2 ./heapcount 
Segmentation fault (core dumped)
evgeny@medusa:~/heapcount$ DOTNET_GCHeapCount=3 ./heapcount 
Segmentation fault (core dumped)
evgeny@medusa:~/heapcount$ DOTNET_GCHeapCount=4 ./heapcount 
Hello, World!
Segmentation fault (core dumped)
evgeny@medusa:~/heapcount$ DOTNET_GCHeapCount=4 ./heapcount 
Hello, World!
evgeny@medusa:~/heapcount$ DOTNET_GCHeapCount=4 ./heapcount 
Hello, World!
evgeny@medusa:~/heapcount$ DOTNET_GCHeapCount=4 ./heapcount 
Hello, World!
Segmentation fault (core dumped)
evgeny@medusa:~/heapcount$ DOTNET_GCHeapCount=5 ./heapcount 
Hello, World!
evgeny@medusa:~/heapcount$ DOTNET_GCHeapCount=5 ./heapcount 
Hello, World!
Segmentation fault (core dumped)

Core file: heapcount-2-segfault.zip

On this particular machine (medusa) it never seems to crash with GCHeapCount=1, always with GCHeapCount=2, sometimes with 4. On another it usually crashes with GCHeapCount=1. On another it does not crash for any GCHeapCount I've tried.

Machines where it crashes are an Intel Xeon 6256 and a AMD EPYC 7302, with 512 GB RAM each. A machine on which it doesn't crash is Xeon E5-1650 with 256 GB RAM. All running Ubuntu 22.04.3. No Docker involved.

Build machine's dotnet --info:

.NET SDK:
 Version:   7.0.402
 Commit:    791db8e2d8

Runtime Environment:
 OS Name:     linuxmint
 OS Version:  20
 OS Platform: Linux
 RID:         linux-x64
 Base Path:   /usr/share/dotnet/sdk/7.0.402/

Host:
  Version:      7.0.12
  Architecture: x64
  Commit:       4a824ef37c

.NET SDKs installed:
  6.0.415 [/usr/share/dotnet/sdk]
  7.0.402 [/usr/share/dotnet/sdk]

.NET runtimes installed:
  Microsoft.AspNetCore.App 3.1.32 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 6.0.23 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 7.0.12 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 3.1.32 [/usr/share/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 6.0.23 [/usr/share/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 7.0.12 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

Other architectures found:
  None

Environment variables:
  Not set

global.json file:
  Not found
loop-evgeny commented 11 months ago

Ping @mangod9 (not sure if you get notifications for all comments on this issue)

mangod9 commented 11 months ago

hey @loop-evgeny, so I assume this only repros on machines with multiple NUMA nodes? Have you checked with .NET 7?

loop-evgeny commented 11 months ago

@mangod9 Not according to lscpu. That reports NUMA node(s): 1 on both the servers on which I've seen the crash (as well as on those where it doesn't crash).

I have not tried with .NET 7, but just tried .NET 8 RC 1 a few times and have not seen a crash, so it seems like this might be fixed!

mangod9 commented 11 months ago

yeah we made some fixes related to this in .NET 7. If this is blocking we can look into porting back to 6, but .NET 8 which is LTS should be released next month.

loop-evgeny commented 11 months ago

We've seen it crash reliably with DOTNET_GCHeapCount from 2 to 6, sometimes with DOTNET_GCHeapCount from 7 to 9 and so far never with DOTNET_GCHeapCount=10, so it's not blocking us immediately, but without understanding the problem, I'm a bit concerned that it may yet start crashing on new servers or under new circumstances. Do you have some idea of what triggers it and how we can be sure to avoid it on .NET 6?

loop-evgeny commented 11 months ago

Just found that on another server, with an Intel Xeon Gold 6210U CPU (still 1 NUMA node), it crashes with a heap count of up to 13. Seems to work with 14. But that perfectly demonstrates what I was concerned about above. We can, of course, set it to 14.. or 15... or 20 - but how do we know what value is safe?