dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.93k stars 4.64k forks source link

Deadlock in GC #105780

Open jhudsoncedaron opened 1 month ago

jhudsoncedaron commented 1 month ago

Description

Active worker thread is hung up with a diagnostic call stack that looks something like this:

ntdll.dll!syscall_gate KERNELBASE.dll!WaitForMultipleObjects ??? CORINFO_HELP_NEWARR_1_VC inlined!StringBuilder.ctor Cedaron.Common.HL7.DLL!Cedaron.Common.HL7.HL7Segment.HL7Decode

All managed stacks as follows:

Main thread:
    ntdll.dll!00007ff9df400d24()    Unknown
    KERNELBASE.dll!00007ff9dc9d464c()   WaitForSingleObject
    [Managed to Native Transition]  
    System.Private.CoreLib.dll!System.Threading.Monitor.Wait(object obj, int millisecondsTimeout) Line 156  C#
    System.Private.CoreLib.dll!System.Threading.ManualResetEventSlim.Wait(int millisecondsTimeout, System.Threading.CancellationToken cancellationToken) Line 561   C#
    System.Private.CoreLib.dll!System.Threading.Tasks.Task.SpinThenBlockingWait(int millisecondsTimeout, System.Threading.CancellationToken cancellationToken) Line 3072    C#
    System.Private.CoreLib.dll!System.Threading.Tasks.Task.InternalWaitCore(int millisecondsTimeout, System.Threading.CancellationToken cancellationToken) Line 3007    C#
    System.Private.CoreLib.dll!System.Threading.Tasks.Task.Wait(int millisecondsTimeout, System.Threading.CancellationToken cancellationToken) Line 2768    C#
    System.Private.CoreLib.dll!System.Threading.Tasks.Task.Wait(int millisecondsTimeout) Line 2728  C#
    [Waiting on Async Operation, double-click or press enter to view Async Call Stacks] 
    System.IO.FileSystem.Watcher.dll!System.IO.FileSystemWatcher.WaitForChanged(System.IO.WatcherChangeTypes changeType, int timeout) Line 631  C#
    Cedaron.Interface.Engine.dll!Cedaron.Interface.Engine.Program.Main(string[] args) Line 300  C#

Background task scheduler thread:
    KERNELBASE.dll!00007ff9dc9ad71e()   WaitForMultipleObjects
    [Managed to Native Transition]  
    Cedaron.Common.dll!Cedaron.Common.Portable.MultiWaiter.Win32WaitForAnyObject(nint[] waithandle, int tv_sec, int tv_msec)    Unknown
    Cedaron.Common.dll!Cedaron.Common.Portable.MultiWaiter.WaitForAnyRead(System.Collections.Generic.IList<Cedaron.Common.Either<System.Net.Sockets.Socket, Cedaron.Common.Portable.IMultiWaitable>> waitfor, int tv_sec, int tv_msec)  Unknown
    Cedaron.Interface.Engine.dll!Cedaron.Interface.Engine.InterfaceEngine.BackgroundJobRunner()

Background keepalive thread:
    ntdll.dll!00007ff9df400254()    Unknown
    KERNELBASE.dll!00007ff9dc9ad71e()   WaitForMultipleObjects
    [Managed to Native Transition]  
    System.Private.CoreLib.dll!System.Threading.Thread.Sleep(int millisecondsTimeout) Line 367  C#
    Cedaron.Interface.FormBuilder.dll!Cedaron.Interface.FormBuilder.APIConfiguration.RefreshToken() Unknown

.NET TP Gate:
    ntdll.dll!00007ff9df400d24()    Unknown
    KERNELBASE.dll!00007ff9dc9d464c()   WaitForSingleObject
    [Managed to Native Transition]  
>   System.Private.CoreLib.dll!System.Threading.WaitHandle.WaitOneNoCheck(int millisecondsTimeout) Line 128 C#
    System.Private.CoreLib.dll!System.Threading.PortableThreadPool.GateThread.GateThreadStart() Line 48 C#

.NET ThreadPool IO:
    ntdll.dll!00007ff9df4030c4()    Unknown
    KERNELBASE.dll!00007ff9dcab2724()   Unknown
    [Managed to Native Transition]  
>   System.Private.CoreLib.dll!System.Threading.PortableThreadPool.IOCompletionPoller.Poll() Line 188   C#

.NET Timer:
    ntdll.dll!00007ff9df400254()    Unknown
    KERNELBASE.dll!00007ff9dc9ad71e()   WaitForMultipleObjects
    [Managed to Native Transition]  
>   System.Private.CoreLib.dll!System.Threading.WaitHandle.WaitOneNoCheck(int millisecondsTimeout) Line 128 C#
    System.Private.CoreLib.dll!System.Threading.TimerQueue.TimerThread() Line 87    C#

Connection Listener:
    ntdll.dll!00007ff9df400254()    Unknown
    KERNELBASE.dll!00007ff9dc9ad71e()   WaitForMultipleObjects
    [Managed to Native Transition]  
    Cedaron.Common.dll!Cedaron.Common.Portable.MultiWaiter.Win32WaitForAnyObject(nint[] waithandle, int tv_sec, int tv_msec)    Unknown
    Cedaron.Common.dll!Cedaron.Common.Portable.MultiWaiter.WaitForAnyRead(System.Collections.Generic.IList<Cedaron.Common.Either<System.Net.Sockets.Socket, Cedaron.Common.Portable.IMultiWaitable>> waitfor, int tv_sec, int tv_msec)  Unknown
    Cedaron.HL7.Client.dll!Cedaron.HL7.Client.SocketTransport.GetConnected(System.Action monitor, int monitorinterval)  Unknown
>   Cedaron.Interface.Engine.dll!Cedaron.Interface.Receivers.TCP.Cedaron.Interface.Declarations.IInterfaceReceiver.ReceiveLoop()
    Cedaron.Interface.Engine.dll!Cedaron.Interface.Engine.Interface.ReceiveLoop()

.NET Finalizer:
>   ntdll.dll!00007ff9df400d24()    Unknown
    KERNELBASE.dll!00007ff9dc9d464c()   WaitForSingleObject
    kernel32.dll!00007ff9dde74cb0() Unknown
    ntdll.dll!00007ff9df3deceb()    Unknown

.NET Server GC:
>   ntdll.dll!00007ff9df400254()    Unknown
    KERNELBASE.dll!00007ff9dc9ad71e()   WaitForMultipleObjects
    kernel32.dll!00007ff9dde74cb0() Unknown
    ntdll.dll!00007ff9df3deceb()    Unknown

Only Active Worker Thread:
    ntdll.dll!00007ff9df400254()    Unknown
    KERNELBASE.dll!00007ff9dc9ad71e()   WaitForMultipleObjects
    [Managed to Native Transition]  
>   Cedaron.Common.HL7.dll!Cedaron.Common.HL7.HL7Segment.HL7Decode(string data, int offset, int length, char msh1, string msh2, bool translateunknownesc) Line 391  C#
    Cedaron.Common.HL7.dll!Cedaron.Common.HL7.HL7Segment.HL7Segment(string msgdata, int offset, int length, char msh1, string msh2, bool translateunknownesc) Line 300  C#
    Cedaron.Common.HL7.dll!Cedaron.Common.HL7.HL7Message.HL7Message(byte[] message, int offset, int length, bool translateunknownesc) Line 523  C#
    Cedaron.Interface.FormBuilder.dll!Cedaron.Interface.FormBuilder.ADT.UpdateManualPromoteRecords()
    Cedaron.Interface.FormBuilder.dll!Cedaron.Interface.FormBuilder.ADT.ProcessADT()
    Cedaron.Interface.FormBuilder.dll!Cedaron.Interface.FormBuilder.ADT.Cedaron.Interface.Declarations.IMessageProcessor.ProcessMessage()
    Cedaron.Interface.FormBuilder.dll!Cedaron.Interface.FormBuilder.DatabaseArchiver.Cedaron.Interface.Declarations.IMessageProcessor.ProcessMessage()
    Cedaron.Interface.Engine.dll!Cedaron.Interface.Engine.Interface.ProcessMessage()
    Cedaron.Interface.Engine.dll!Cedaron.Interface.Receivers.TCP.Cedaron.Interface.Declarations.IInterfaceReceiver.ReceiveLoop.AnonymousMethod__0(Cedaron.HL7.Client.HL7BlockingStream stream)
    Cedaron.Interface.Engine.dll!Cedaron.Interface.Receivers.TCP.Cedaron.Interface.Declarations.IInterfaceReceiver.ReceiveLoop()
    Cedaron.Interface.Engine.dll!Cedaron.Interface.Engine.Interface.ReceiveLoop()

.NET Server GC:
>   ntdll.dll!00007ff9df400254()    Unknown
    KERNELBASE.dll!00007ff9dc9ad71e()   WaitForMultipleObjects
    kernel32.dll!00007ff9dde74cb0() Unknown
    ntdll.dll!00007ff9df3deceb()    Unknown

.NET Server GC:
>   ntdll.dll!00007ff9df400254()    Unknown
    KERNELBASE.dll!00007ff9dc9ad71e()   WaitForMultipleObjects
    kernel32.dll!00007ff9dde74cb0() Unknown
    ntdll.dll!00007ff9df3deceb()    Unknown

.NET Debugger:
>   ntdll.dll!00007ff9df400d24()    Unknown
    KERNELBASE.dll!00007ff9dc9d464c()   WaitForSingleObject
    kernel32.dll!00007ff9dde74cb0() Unknown
    ntdll.dll!00007ff9df3deceb()    Unknown

.NET Server GC:
>   ntdll.dll!00007ff9df400254()    Unknown
    KERNELBASE.dll!00007ff9dc9ad71e()   WaitForMultipleObjects
    kernel32.dll!00007ff9dde74cb0() Unknown
    ntdll.dll!00007ff9df3deceb()    Unknown

.NET Server GC:
>   ntdll.dll!00007ff9df400254()    Unknown
    KERNELBASE.dll!00007ff9dc9ad71e()   WaitForMultipleObjects
    kernel32.dll!00007ff9dde74cb0() Unknown
    ntdll.dll!00007ff9df3deceb()    Unknown

.NET Event Pipe:
    [Waiting on a lock, load symbols for ntdll.dll to show thread lock information] 
>   ntdll.dll!00007ff9df403cc4()    Unknown
    ntdll.dll!00007ff9df379f09()    Unknown
    ntdll.dll!00007ff9df3774c2()    Unknown
    kernel32.dll!00007ff9dde74cb0() Unknown
    ntdll.dll!00007ff9df3deceb()    Unknown

.NET Server GC:
>   ntdll.dll!00007ff9df400254()    Unknown
    KERNELBASE.dll!00007ff9dc9ad71e()   WaitForMultipleObjects
    kernel32.dll!00007ff9dde74cb0() Unknown
    ntdll.dll!00007ff9df3deceb()    Unknown

.NET Server GC:
>   ntdll.dll!00007ff9df400254()    Unknown
    KERNELBASE.dll!00007ff9dc9ad71e()   WaitForMultipleObjects
    kernel32.dll!00007ff9dde74cb0() Unknown
    ntdll.dll!00007ff9df3deceb()    Unknown

.NET Server GC:
>   ntdll.dll!00007ff9df400254()    Unknown
    KERNELBASE.dll!00007ff9dc9ad71e()   WaitForMultipleObjects
    kernel32.dll!00007ff9dde74cb0() Unknown
    ntdll.dll!00007ff9df3deceb()    Unknown

.NET TP Worker:
    ntdll.dll!00007ff9df400254()    Unknown
    KERNELBASE.dll!00007ff9dc9ad71e()   WaitForMultipleObjects
    [Managed to Native Transition]  
>   System.Private.CoreLib.dll!System.Threading.LowLevelLifoSemaphore.WaitForSignal(int timeoutMs) Line 140 C#
    System.Private.CoreLib.dll!System.Threading.LowLevelLifoSemaphore.Wait(int timeoutMs, bool spinWait) Line 65    C#
    System.Private.CoreLib.dll!System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart() Line 100    C#

.NET BGC:
>   ntdll.dll!00007ff9df400254()    Unknown
    KERNELBASE.dll!00007ff9dc9ad71e()   WaitForMultipleObjects
    kernel32.dll!00007ff9dde74cb0() Unknown
    ntdll.dll!00007ff9df3deceb()    Unknown

.NET BGC:
>   ntdll.dll!00007ff9df400254()    Unknown
    KERNELBASE.dll!00007ff9dc9ad71e()   WaitForMultipleObjects
    kernel32.dll!00007ff9dde74cb0() Unknown
    ntdll.dll!00007ff9df3deceb()    Unknown

.NET BGC:
>   ntdll.dll!00007ff9df400254()    Unknown
    KERNELBASE.dll!00007ff9dc9ad71e()   WaitForMultipleObjects
    kernel32.dll!00007ff9dde74cb0() Unknown
    ntdll.dll!00007ff9df3deceb()    Unknown

.NET BGC:
>   ntdll.dll!00007ff9df400254()    Unknown
    KERNELBASE.dll!00007ff9dc9ad71e()   WaitForMultipleObjects
    kernel32.dll!00007ff9dde74cb0() Unknown
    ntdll.dll!00007ff9df3deceb()    Unknown

.NET BGC:
>   ntdll.dll!00007ff9df400254()    Unknown
    KERNELBASE.dll!00007ff9dc9ad71e()   WaitForMultipleObjects
    kernel32.dll!00007ff9dde74cb0() Unknown
    ntdll.dll!00007ff9df3deceb()    Unknown

.NET BGC:
>   ntdll.dll!00007ff9df400254()    Unknown
    KERNELBASE.dll!00007ff9dc9ad71e()   WaitForMultipleObjects
    kernel32.dll!00007ff9dde74cb0() Unknown
    ntdll.dll!00007ff9df3deceb()    Unknown

.NET TP Worker:
    ntdll.dll!00007ff9df400254()    Unknown
    KERNELBASE.dll!00007ff9dc9ad71e()   WaitForMultipleObjects
    [Managed to Native Transition]  
>   System.Private.CoreLib.dll!System.Threading.LowLevelLifoSemaphore.WaitForSignal(int timeoutMs) Line 140 C#
    System.Private.CoreLib.dll!System.Threading.LowLevelLifoSemaphore.Wait(int timeoutMs, bool spinWait) Line 65    C#
    System.Private.CoreLib.dll!System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart() Line 100    C#
    [Native to Managed Transition]  
    kernel32.dll!00007ff9dde74cb0() Unknown
    ntdll.dll!00007ff9df3deceb()    Unknown

Reproduction Steps

No controlled reproduction available. We currently get the issue about three times per two weeks.

Expected behavior

new StringBuilder() does not hang forever.

Actual behavior

new StringBuilder() calls WaitForMultipleObjects, which does not return

Regression?

Yes; Issue is less than three months old.

Known Workarounds

No response

Configuration

.NET Runtime 8.0.7 via dotnet publish -r win-x64 OS: Windows Server Core; probably 2022

Other information

We have a full memory dump but can't release it publicly. This is a production instance and PHI is on the stack.

jkotas commented 1 month ago

Is this with regular CoreCLR with a JIT or with native AOT? RhpNewArray is a native AOT specific method, but the stacktraces you have shared look like regular CoreCLR with a JIT.

jhudsoncedaron commented 1 month ago

@jkotas : I'm sorry I don't understand the difference. When I did a code search for CORINFO_HELP_NEWARR_1_VC ; it seems to be mapped to RhpNewArray by RyuJIT. The environment is whatever dotnet publish -r win-x64 spits out.

jkotas commented 1 month ago

cc @VSadov

dotnet-policy-service[bot] commented 1 month ago

Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.

mangod9 commented 1 month ago

Hi @jhudsoncedaron, you mention that this started occurring recently, so do you observe that this is a regression in 8.0.7?

jhudsoncedaron commented 1 month ago

@mangod9 : Barring exotic hypotheses, yes; however we jumped straight from 8.0.5 to 8.0.7 and are not sure of the exact start date of the problem. Support was slow to report the issue to engineering. In addition, we just turned on gcDynamic a couple of months ago.

However exotic hypotheses are on the table; the data being processed as seen in the dump is trying to allocate about 5,000 StringBuilder objects in 50ms.

mangod9 commented 1 month ago

Hey @jhudsoncedaron, are you able to privately share a dump to investigate further?

jhudsoncedaron commented 1 month ago

@mangod9 : I will ask the guy capable of approving immediately. If that doesn't work, there's other ways of getting necessary information from a dump.

jhudsoncedaron commented 1 month ago

@mangod9 : Dump upload is not happening. So it comes down to tell me what to look for and I can look for it.

jhudsoncedaron commented 1 month ago

So I feel the need to explain a few things.

We are not sitting idle on this. At once other things are happening. 1) We are trying to switch that one instance to workstation GC to see if it clears it (this will hit production first). 2) We are optimizing the allocation-hot loop (in fact the very one with the blocked thread) to allocate less stuff. This should make it happen less often. 3) We looked at the wait handle code; and found a very unlikey pathway that could corrupt wait handle references. This requires two billion exceptions to be thrown from the same finally block where the only possible exception is OutOfMemoryException allocating a List enumerator. On triggering the fault it would overflow the reference counter inside SafeWaitHandle; but I don't see how it could cause this particular fault. (Fix: replace that particular foreach loop with a counted loop). 4) We are trying to find a way to reproduce this in a dev environment. It's not looking promising. 5) The dump contains Medical Records data. There's no way any production dump is getting uploaded anywhere until all other options are exhausted. I am reasonably certain a paid support case will happen first.

jkotas commented 1 month ago

Could you please load the dump into windbg debugger and the stacktraces of all threads from windbg? The stacktraces above seems to be from Visual Studio and they are missing low-level details that we need to diagnose this.

Also, could you please run !threads command in windbg and share its output?

jhudsoncedaron commented 1 month ago

Output of !threads and ~e!dumpstacks

Unfortunately !stacks is broken. What's it mean "PsActiveProcessHead!"?

stacks-lightredacted.txt

jkotas commented 1 month ago

Could you please share the output of ~*k and !sos.threads?

~*k has more precise stacktraces. dumpstack prints approximate stacktraces and it is hard to tell what's exactly going on from those.

!sos.threads is explicitly invoking threads command from sos debugger extension. The !threads output that you have shared is from built-in windbg !threads command. It is missing .NET runtime specific information. If the sos extension is not present in your windbg, it probably means that your windbg is too old. Try installing latest windbg from https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/debugger-download-tools .

jhudsoncedaron commented 1 month ago

@jkotas : You're in luck, both windbg and sos.dll were downloaded and installed anew (for the first time on this SSD) today. Here's the output of your additional commands. I didn't find anything worth redacting this time around.

stacks2-full.txt

jkotas commented 1 month ago

stacks2-full.txt

This shows that all threads are waiting for GC to finish, but there does not seem to be any thread making progress to finish the GC.

One possible explanation of this situation is use-after-free bug with Win32 HANDLEs somewhere else in the program. If some other part of the program inadvertently waited on the HANDLE used by the GC, it can explain why the GC gets stuck like this.

It may be useful to find the thread that's holding the ThreadStore lock that the .NET EventPipe thread is trying to acquire. It may give us some clues. Could you please find the OwningThread from the CRTICAL_SECTION argument passed in into RtlEnterCriticalSection?

   1  Id: 1568.1be8 Suspend: 0 Teb: 000000cf`a93fd000 Unfrozen ".NET EventPipe"
 # Child-SP          RetAddr               Call Site
00 000000cf`a987f558 00007ff9`df379f09     ntdll!NtWaitForAlertByThreadId+0x14
01 000000cf`a987f560 00007ff9`df3774c2     ntdll!RtlpEnterCriticalSectionContended+0x339
02 000000cf`a987f680 00007ff9`d52f8c6a     ntdll!RtlEnterCriticalSection+0x42
03 000000cf`a987f6b0 00007ff9`d5364498     coreclr!CrstBase::Enter+0x5a [D:\a\_work\1\s\src\coreclr\vm\crst.cpp @ 328] 
04 (Inline Function) --------`--------     coreclr!ThreadStore::Enter+0x10 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 5179] 
05 000000cf`a987f6e0 00007ff9`d5368e0b     coreclr!ThreadSuspend::LockThreadStore+0x90 [D:\a\_work\1\s\src\coreclr\vm\threadsuspend.cpp @ 1894] 
06 (Inline Function) --------`--------     coreclr!ThreadStore::LockThreadStore+0x5 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 5201] 
07 (Inline Function) --------`--------     coreclr!StateHolder<&ThreadStore::LockThreadStore,&ThreadStore::UnlockThreadStore>::Acquire+0x5 [D:\a\_work\1\s\src\coreclr\inc\holder.h @ 354] 
08 (Inline Function) --------`--------     coreclr!StateHolder<&ThreadStore::LockThreadStore,&ThreadStore::UnlockThreadStore>::{ctor}+0x5 [D:\a\_work\1\s\src\coreclr\inc\holder.h @ 340] 
09 000000cf`a987f710 00007ff9`d5368631     coreclr!ThreadStore::AddThread+0x1f [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 5227] 
0a 000000cf`a987f740 00007ff9`d5658c20     coreclr!SetupUnstartedThread+0x59 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 879] 
0b 000000cf`a987f770 00007ff9`d565b411     coreclr!ep_rt_thread_create+0x100 [D:\a\_work\1\s\src\coreclr\vm\eventing\eventpipe\ep-rt-coreclr.h @ 855] 
0c 000000cf`a987f830 00007ff9`d5655030     coreclr!sample_profiler_enable+0x45 [D:\a\_work\1\s\src\native\eventpipe\ep-sample-profiler.c @ 204] 
jhudsoncedaron commented 1 month ago

0:001> !locks

CritSec +36c3c6f0 at 000002be36c3c6f0 WaiterWoken No LockCount 1 RecursionCount 1 OwningThread 1bf0 EntryCount 0 ContentionCount 42 *** Locked

Scanned 48 critical sections

Oh good there's exactly one held critical section; must be this one. OwningThread is 1bf0 which is a .NET Server GC thread.

Confirmed: 1be8 coreclr!CrstBase::Enter this->mCriticalSection:

LockCount -6
RecursionCount 1
OwningThread 0x1bf0
LockSemaphore 0xffffffffffffffff
SpinCount 0x200007cf

I think the event pipe is a red herring. We don't use it. This thread appears to be from when we tried to run dotnet-stacks on the already deadlocked process.

This shows that all threads are waiting for GC to finish, but there does not seem to be any thread making progress to finish the GC.

Which was my original analysis of the stacks from Visual Studio.

One possible explanation of this situation is use-after-free bug with Win32 HANDLEs somewhere else in the program.

That particular explanation would require the HANDLE that was used after it was freed to have been allocated before the GC's event HANDLE. I am having a hard time believing this as I would expect the GC to eagerly allocate its own resources at process startup.

Correct me if I'm wrong but I think this is testable hypothesis. If there's an early handle in managed code that is being used after it's freed, while being allocated before the GC runs the first time, I should be able to move the problem to a different handle by putting the following at the top of main: GC.Collect(); GC.WaitForPendingFinalizers();.

dotnet-policy-service[bot] commented 1 month ago

Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.

jkotas commented 1 month ago

I have noticed that stacks2-full.txt shows 8x .NET Server GC threads, but only 6x .NET BGC threads. Did we lose or fail to create two .NET BGC threads?

This looks like GC threading issue. cc @dotnet/gc

jkotas commented 1 month ago

@jhudsoncedaron You can test whether the problem is related to background GC (BGC) by trying to disable it: https://learn.microsoft.com/en-us/dotnet/core/runtime-config/garbage-collector#background-gc

jhudsoncedaron commented 1 month ago

@jkotas : Yes I noticed that too. Since the number was even I didn't think much of it.

jhudsoncedaron commented 1 month ago

Current status of the running code has been set to: Server GC ON DATAS ON Background GC OFF (confirmed by dump)

Now we wait, I guess. If it doesn't recur it was likely a fault in background GC.

markples commented 1 month ago

SVR threads are created during initialization, but BGC threads are created on demand and DATAS is on, so the 6 vs 8 could be normal behavior. However, the BGC join object is created during initialization. Disabling DATAS (but leaving BGC enabled) would provide another interesting data point on this. Also GC traces will probably be useful in determining context for any possible GC issues here.

markples commented 1 month ago

DATAS - DOTNET_GCDynamicAdaptationMode or System.GC.DynamicAdaptationMode

mangod9 commented 1 month ago

@jhudsoncedaron, did the issue not reproduce without BGC? Also since this is 8, assume you have explicitly enabled DATAS?

jhudsoncedaron commented 1 month ago

@mangod9 : Correct, DATAS is enabled explicitly.

I have not had a report that it reproduced again after setting BGC off; however not enough time has elapsed to be sure. Note that business day assumption is not valid; so we do not yet have 50% confidence.

The way the math works out, we need two more weeks for 99% confidence (assuming normal distribution), which is an amazing coincidence because I will be on vacation the next two weeks. Operations is monitoring the situation; however it is likely this thread will not be updated until I return.

A mitigation is scheduled to be installed Tuesday Evening 8/6. This mitigation reduces the number of allocations required in the allocation-hot loop (the stack that locked up allocating a StringBuilder) by removing the "small substrings are cheap" assumption. The number of allocations was reduced by almost two-thirds.

mangod9 commented 1 month ago

Perhaps also trying with DATAS off but BGC on would be helpful. We have made quite a few changes to DATAS in 9, but wondering if it has shown reasonable improvements for your app since you have enabled it in 8.

jhudsoncedaron commented 1 month ago

@mangod9 : That's next on the list to try; however this entails switching back to workstation GC. Trying to enable server GC without DATAS will invariably lock up the server. The only way this server works is with the ability to shuffle memory between running processes on demand.

jhudsoncedaron commented 2 weeks ago

Status Update: No more incidents occurred after turning off background GC.

mangod9 commented 2 weeks ago

so just to confirm its with DATAS enabled but BGC disabled?

jhudsoncedaron commented 2 weeks ago

@mangod9 : Correct; Current status is DATAS enabled, BGC disabled, no more deadlocks.

Maoni0 commented 2 weeks ago

@jhudsoncedaron and I got in contact via email and will go from there.

Maoni0 commented 1 week ago

I did some debugging with @jhudsoncedaron yesterday (thanks so much for agreeing to do the debugging session!) and GC did try to create 7 BGC threads but one of the BGC threads is simply not there yet we do have a Thread object for it, and the thread creation call returned success. so I'm still looking at that part. this caused the deadlock since we are supposed to have 7 threads joining but only got 6.

cshung commented 2 days ago

@jhudsoncedaron

The instrumented binary is ready and can be accessed at https://github.com/Maoni0/clrgc/tree/main/issues/105780. Detailed instructions on how to use it can be found at https://github.com/Maoni0/clrgc/blob/main/README.md.

Please let me know if you need any further assistance or information regarding this.