dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.97k stars 4.66k forks source link

Segmentation fault during app startup on .NET6 #107876

Open blushingpenguin opened 3 days ago

blushingpenguin commented 3 days ago

Description

Running an app in a docker container on microk8s 1.30 fails with a segmentation fault. The segmentation fault does not occur if the RAM limit is removed from the container.

Reproduction Steps

I can reproduce this with several of our (similar) applications that are web apis, but I don't have a good isolated reproduction

Expected behavior

Does not crash

Actual behavior

from app...

...some startup messages...
[2024-09-16 13:58:08 DBG] MassTransit.Transports.BusDepot Starting bus instances: IBus
[2024-09-16 13:58:08 DBG] MassTransit Starting bus: rabbitmqs://hoppy.rabbitmq.svc.cluster.local/dev
Segmentation fault (core dumped)

collecting a core dump then creating a backtrace with lldb:

(lldb) bt
* thread #1, name = 'Vendeq.Jobs.Api', stop reason = signal SIGSEGV
  * frame #0: 0x00007be9fb2a6bbb libcoreclr.so`SVR::GCHeap::Alloc(this=<unavailable>, context=0x00007be8500d9168, size=152, flags=2) at gc.cpp:43631:47
    frame #1: 0x00007be9fb174877 libcoreclr.so`AllocateSzArray(MethodTable*, int, GC_ALLOC_FLAGS) at gchelpers.cpp:228:48
    frame #2: 0x00007be9fb17480f libcoreclr.so`AllocateSzArray(pArrayMT=<unavailable>, cElements=16, flags=GC_ALLOC_CONTAINS_REF) at gchelpers.cpp:0
    frame #3: 0x00007be9fafe5e23 libcoreclr.so`ThreadStaticHandleTable::AllocateHandles(unsigned int) at appdomain.cpp:524:35
    frame #4: 0x00007be9fafe5e07 libcoreclr.so`ThreadStaticHandleTable::AllocateHandles(this=0x00007be8140017f0, nRequested=16) at appdomain.cpp:610:19
    frame #5: 0x00007be9fb0ed42d libcoreclr.so`ThreadStatics::AllocateAndInitTLM(ModuleIndex, ThreadLocalBlock*, Module*) [inlined] ThreadLocalBlock::AllocateStaticFieldObjRefPtrs(this=0x00007be8500d9558, nRequested=16, ppLazyAllocate=0x00007be814000f00) at threadstatics.cpp:358:55
    frame #6: 0x00007be9fb0ed3ef libcoreclr.so`ThreadStatics::AllocateAndInitTLM(ModuleIndex, ThreadLocalBlock*, Module*) [inlined] ThreadLocalBlock::AllocateThreadStaticHandles(this=0x00007be8500d9558, pModule=<unavailable>, pThreadLocalModule=0x00007be814000ef0) at threadstatics.cpp:326:9
    frame #7: 0x00007be9fb0ed3e2 libcoreclr.so`ThreadStatics::AllocateAndInitTLM(index=(m_dwIndex = 0), pThreadLocalBlock=0x00007be8500d9558, pModule=<unavailable>) at threadstatics.cpp:651:24
    frame #8: 0x00007be9fb1906f1 libcoreclr.so`JIT_GetGCThreadStaticBase_Helper(pMT=0x00007be981fb0508) at jithelpers.cpp:1760:46
    frame #9: 0x00007be981684d35
    frame #10: 0x00007be9816695e2
    frame #11: 0x00007be9fb2f0657 libcoreclr.so`CallDescrWorkerInternal at unixasmmacrosamd64.inc:850
    frame #12: 0x00007be9fb12614e libcoreclr.so`DispatchCallSimple(unsigned long*, unsigned int, unsigned long, unsigned int) at callhelpers.cpp:67:5
    frame #13: 0x00007be9fb1260f5 libcoreclr.so`DispatchCallSimple(pSrc=<unavailable>, numStackSlotsToCopy=<unavailable>, pTargetAddress=<unavailable>, dwDispatchCallSimpleFlags=<unavailable>) at callhelpers.cpp:220:9
    frame #14: 0x00007be9fb13ecd2 libcoreclr.so`ThreadNative::KickOffThread_Worker(ptr=<unavailable>) at comsynchronizable.cpp:157:5
    frame #15: 0x00007be9fb0eac0a libcoreclr.so`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) [inlined] ManagedThreadBase_DispatchInner(pCallState=<unavailable>) at threads.cpp:7321:5
    frame #16: 0x00007be9fb0eac08 libcoreclr.so`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) at threads.cpp:7365:9
    frame #17: 0x00007be9fb0eabc2 libcoreclr.so`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) [inlined] ManagedThreadBase_DispatchOuter(this=<unavailable>, pParam=<unavailable>)::$_6::operator()(ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)::TryArgs*) const::'lambda'(Param*)::operator()(Param*) const at threads.cpp:7523:13
    frame #18: 0x00007be9fb0eabc2 libcoreclr.so`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) at threads.cpp:7525:9
    frame #19: 0x00007be9fb0eab53 libcoreclr.so`ManagedThreadBase_DispatchOuter(pCallState=0x00007be9129ffd20) at threads.cpp:7549:5
    frame #20: 0x00007be9fb0eb20d libcoreclr.so`ManagedThreadBase::KickOff(void (*)(void*), void*) [inlined] ManagedThreadBase_FullTransition(pTarget=<unavailable>, args=<unavailable>, filterType=ManagedThread)(void*), void*, UnhandledExceptionLocation) at threads.cpp:7569:5
    frame #21: 0x00007be9fb0eb1f5 libcoreclr.so`ManagedThreadBase::KickOff(pTarget=<unavailable>, args=<unavailable>)(void*), void*) at threads.cpp:7604:5
    frame #22: 0x00007be9fb13eda7 libcoreclr.so`ThreadNative::KickOffThread(pass=0x00007be8500d9110) at comsynchronizable.cpp:228:9
    frame #23: 0x00007be9fb484b0e libcoreclr.so`CorUnix::CPalThread::ThreadEntry(pvParam=0x00007be8500db610) at thread.cpp:1862:16
    frame #24: 0x00007be9fb804ac3 libc.so.6`___lldb_unnamed_symbol3481 + 755
    frame #25: 0x00007be9fb896850 libc.so.6`___lldb_unnamed_symbol3866 + 11

Regression?

No response

Known Workarounds

Removing or increasing the container ram limit seems to avoid the problem

Configuration

Microsoft.AspNetCore.App 6.0.33 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App] Microsoft.NETCore.App 6.0.33 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

running in a docker container based on ubuntu:22.04 running on microk8s 1.30 on a server running ubuntu24.04 (amd64)

Other information

No response

dotnet-policy-service[bot] commented 3 days ago

Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.

mangod9 commented 3 days ago

hello @blushingpenguin, thanks for reporting the issue. Can you please clarify what config are you specifying for memory limits? Also .NET 6 would be going out of support soon, so checking if this repros on 8 for you?

blushingpenguin commented 3 days ago

@mangod9 I can't check on .NET8 unfortunately, the app isn't (yet) compatible. For memory:

    resources:
      limits:
        memory: 300Mi
      requests:
        memory: 250Mi
mangod9 commented 3 days ago

ok thanks. Is this a regression in a 6 servicing release, or is this a new application which is being tried with memory limits. Would be possible to share a dump of the failure?

blushingpenguin commented 3 days ago

@mangod9 I could share a dump privately. The regression is in adding newer nodes to the cluster and migrating workloads (they are 16 core / 32 threads vs 6 core / 12 threads). I suspect the difference in core count is involved somewhere -- the base OS of the new servers is ubuntu 24.04 vs 22.04 but they are the exact same containers, they crash on some nodes and not others.

I've tried playing around with GC settings in a container, and the two that make a difference are:

DOTNET_GCHeapHardLimit=10048576

if below 10Mb (tried in 1Mb chunks) then I get a segfault again, but anything higher and it works. Setting the container resource limit to 330Mi also works.

when setting DOTNET_gcServer=0, things work without limits (probably not so interesting)