dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.48k stars 4.76k forks source link

Process not responsive dump indicates garbage collection #110350

Open tornie2 opened 1 day ago

tornie2 commented 1 day ago

Description

After upgrading to .net 9, we have random processes, which just freeze, becoming completely unresponsive Process are run as windows services on windows VM

I have a dump file, which I could send to you I would just rather not make that public as it probably has passwords within

Analyzing the dump indicates a possible problem in the garbage collector

0:000> !analyze -v


KEY_VALUES_STRING: 1

Key  : Analysis.CPU.mSec
Value: 1484

Key  : Analysis.Elapsed.mSec
Value: 5300

Key  : Analysis.IO.Other.Mb
Value: 0

Key  : Analysis.IO.Read.Mb
Value: 1

Key  : Analysis.IO.Write.Mb
Value: 1

Key  : Analysis.Init.CPU.mSec
Value: 781

Key  : Analysis.Init.Elapsed.mSec
Value: 120611

Key  : Analysis.Memory.CommitPeak.Mb
Value: 223

Key  : Analysis.Version.DbgEng
Value: 10.0.27725.1000

Key  : Analysis.Version.Description
Value: 10.2408.27.01 amd64fre

Key  : Analysis.Version.Ext
Value: 1.2408.27.1

Key  : CLR.Engine
Value: CORECLR

Key  : CLR.Version
Value: 9.0.24.52809

Key  : Failure.Bucket
Value: BREAKPOINT_80000003_coreclr.dll!WKS::GCHeap::WaitUntilGCComplete

Key  : Failure.Hash
Value: {54e9a6da-d4d0-d004-574b-4219b46bdb8d}

Key  : Failure.Source.FileLine
Value: 265

Key  : Failure.Source.FilePath
Value: D:\a\_work\1\s\src\coreclr\gc\gcee.cpp

Key  : Failure.Source.SourceServerCommand
Value: raw.githubusercontent.com/dotnet/runtime/9d5a6a9aa463d6d10b0b0ba6d5982cc82f363dc3/src/coreclr/gc/gcee.cpp

Key  : Timeline.OS.Boot.DeltaSec
Value: 896327

Key  : Timeline.Process.Start.DeltaSec
Value: 17922

Key  : WER.OS.Branch
Value: rs5_release

Key  : WER.OS.Version
Value: 10.0.17763.1

Key  : WER.Process.Version
Value: 1.0.0.0

FILE_IN_CAB: SmfHaircuts.Service-2024-12-03-YB6213.DMP

NTGLOBALFLAG: 0

APPLICATION_VERIFIER_FLAGS: 0

EXCEPTION_RECORD: (.exr -1) ExceptionAddress: 0000000000000000 ExceptionCode: 80000003 (Break instruction exception) ExceptionFlags: 00000000 NumberParameters: 0

FAULTING_THREAD: 00000f6c

PROCESS_NAME: SmfHaircuts.Service.dll

ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION} Breakpoint A breakpoint has been reached.

EXCEPTION_CODE_STR: 80000003

STACK_TEXT:
0000008c61d7e028 00007ffcbdba0f33 : 0000000000000000 000002be90a8dab0 000002be90a8d9f0 0000008c61d7e170 : ntdll!NtWaitForSingleObject+0x14 0000008c61d7e030 00007ffb364f1c30 : 0000000000000000 00004612d1730f35 0000000000000000 0000000000000284 : KERNELBASE!WaitForSingleObjectEx+0x93 0000008c61d7e0d0 00007ffb36416915 : 0000000000000000 0000008c61d7e2d0 0000000000000804 0000008c61d7e1b0 : coreclr!WKS::GCHeap::WaitUntilGCComplete+0x30 0000008c61d7e100 00007ffb364e8328 : 00007ffad68130c0 0000000000000000 0000000000000000 0000027df97fc570 : coreclr!Thread::RareDisablePreemptiveGC+0x9d 0000008c61d7e190 00007ffb3659ea2d : 00007ffad68130c0 0000000000000000 000002be906cedf0 0000000100000000 : coreclr!JIT_ReversePInvokeEnterRare2+0x18 0000008c61d7e1c0 00007ffad7e5b718 : 0000000000000004 0000008c61d7e260 0000000000000000 00007ffcc166598d : coreclr!JIT_ReversePInvokeEnterTrackTransitions+0x9d13d 0000008c61d7e1f0 0000000000000004 : 0000008c61d7e260 0000000000000000 00007ffcc166598d 0000000000000000 : 0x00007ffad7e5b718 0000008c61d7e1f8 0000008c61d7e260 : 0000000000000000 00007ffcc166598d 0000000000000000 0000008c61d7e1f0 : 0x4 0000008c61d7e200 0000000000000000 : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : 0x0000008c61d7e260

STACK_COMMAND: ~0s; .ecxr ; kb

FAULTING_SOURCE_LINE: D:\a_work\1\s\src\coreclr\gc\gcee.cpp

FAULTING_SOURCE_FILE: D:\a_work\1\s\src\coreclr\gc\gcee.cpp

FAULTING_SOURCE_LINE_NUMBER: 265

FAULTING_SOURCE_SRV_COMMAND: https://raw.githubusercontent.com/dotnet/runtime/9d5a6a9aa463d6d10b0b0ba6d5982cc82f363dc3/src/coreclr/gc/gcee.cpp

FAULTING_SOURCE_CODE:
No source found for 'D:\a_work\1\s\src\coreclr\gc\windows\gcenv.windows.cpp'

SYMBOL_NAME: coreclr!WKS::GCHeap::WaitUntilGCComplete+30

MODULE_NAME: coreclr

IMAGE_NAME: coreclr.dll

FAILURE_BUCKET_ID: BREAKPOINT_80000003_coreclr.dll!WKS::GCHeap::WaitUntilGCComplete

OS_VERSION: 10.0.17763.1

BUILDLAB_STR: rs5_release

OSPLATFORM_TYPE: x64

OSNAME: Windows 10

IMAGE_VERSION: 9.0.24.52809

FAILURE_ID_HASH: {54e9a6da-d4d0-d004-574b-4219b46bdb8d}

Followup: MachineOwner

Reproduction Steps

Not possilbe. Happens randomly

Expected behavior

Not freezing

Actual behavior

Process completely unresponsive

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

No response

dotnet-policy-service[bot] commented 1 day ago

Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.

mangod9 commented 1 day ago

Does this happen during startup?

This feels similar to https://github.com/dotnet/runtime/issues/105780. Are you able to try with disabling new GC mode with DOTNET_GCDynamicAdaptationMode=0 ?

The fix for this issue should be included in the Jan servicing release for 9.0.

tornie2 commented 22 hours ago

It is not during startup. Usually the processes will run days before this happens

I have added this to the csproj of the exe of the process wherre we have seen this most often Will this be a temporary fix, until the fix is released?

<ItemGroup>
    <RuntimeHostConfigurationOption Include="DOTNET_GCDynamicAdaptationMode" Value="0" />
</ItemGroup>
mangod9 commented 21 hours ago

If it's not during startup it could be a different issue. If it has been reproing frequently then yeah disabling DOTNET_GCDynamicAdaptationMode would be worth a try. If you are able to share a dump privately that would help in confirming if it's the same issue.

tornie2 commented 10 hours ago

I hope you got an e-mail with a link to the dump file