dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.84k stars 4.62k forks source link

`ThreadPoolWorkQueue..ctor` randomly will never return #104322

Open colejohnson66 opened 1 month ago

colejohnson66 commented 1 month ago

Description

Periodically, I've experienced a deadlock on program start through ThreadPoolWorkQueue..ctor. According to the debugger, that constructor is stuck at RefreshLoggingEnabled, and it never continues. dotnet-dump provided this information, implying it's stuck in a "dynamic helper frame".

Loading core dump: .\dump_20240702_143359.dmp ...
Ready to process analysis commands. Type 'help' to list available commands or 'help [command]' to get detailed help on a command.
Type 'quit' or 'exit' to exit the session.
> threads                                                                                                                                                                                                                   
*0 0x0BD4 (3028)
 1 0x6600 (26112)
 2 0x7554 (30036)
 3 0x6B30 (27440)
 4 0x283C (10300)
 5 0x7D70 (32112)
> clrstack -all                                                                                                                                                                                                             
OS Thread Id: 0xbd4
        Child SP               IP Call Site
000000D131D7C148 00007ff913170af4 [HelperMethodFrame_1OBJ: 000000d131d7c148] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
000000D131D7C2A0 00007FF85A9F791A System.Diagnostics.Tracing.EventSource.Initialize(System.Guid, System.String, System.String[]) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 1619]
000000D131D7C350 00007FF85AA04D39 System.Diagnostics.Tracing.FrameworkEventSource..ctor() [/_/src/coreclr/System.Private.CoreLib/System.Private.CoreLib.Generators/Generators.EventSourceGenerator/FrameworkEventSource.g.cs @ 12]
000000D131D7C3A0 00007FF85AA04D87 System.Diagnostics.Tracing.FrameworkEventSource..cctor() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/FrameworkEventSource.cs @ 14]
000000D131D7D530 00007ff87532b9e3 [DynamicHelperFrame: 000000d131d7d530]
000000D131D7D690 00007FF85A8F2883 System.Threading.TimerQueueTimer.Change(UInt32, UInt32) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Timer.cs @ 559]
000000D131D7D720 00007FF85A8F2776 System.Threading.TimerQueueTimer..ctor(System.Threading.TimerCallback, System.Object, UInt32, UInt32, Boolean) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Timer.cs @ 516]
000000D131D7D770 00007FF85A8F3451 System.Threading.Timer.TimerSetup(System.Threading.TimerCallback, System.Object, UInt32, UInt32, Boolean) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Timer.cs @ 905]    
000000D131D7D7E0 00007FF85A8F3253 System.Threading.Timer..ctor(System.Threading.TimerCallback, System.Object, System.TimeSpan, System.TimeSpan) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Timer.cs @ 861]
000000D131D7D840 00007FF8157B5B0A Microsoft.Data.Sqlite.SqliteConnectionFactory..ctor() [/_/src/Microsoft.Data.Sqlite.Core/SqliteConnectionFactory.cs @ 28]
000000D131D7D8F0 00007FF8157B58AC Microsoft.Data.Sqlite.SqliteConnectionFactory..cctor() [/_/src/Microsoft.Data.Sqlite.Core/SqliteConnectionFactory.cs @ 12]
000000D131D7E658 00007ff87532b9e3 [HelperMethodFrame: 000000d131d7e658]
000000D131D7E760 00007FF8157B57C3 Microsoft.Data.Sqlite.SqliteConnection.set_ConnectionString(System.String) [/_/src/Microsoft.Data.Sqlite.Core/SqliteConnection.cs @ 151]
000000D131D7E7F0 00007FF8157B4669 Microsoft.Data.Sqlite.SqliteConnection..ctor(System.String) [/_/src/Microsoft.Data.Sqlite.Core/SqliteConnection.cs @ 120]
000000D131D7E830 00007FF8157B44BF <removed>.Logging.Logger.OpenDatabaseCore(System.String, Boolean)
000000D131D7E8D0 00007FF8157B438B <removed>.Logging.Logger.OpenDatabaseReadWrite(System.String, Boolean ByRef)
000000D131D7E920 00007FF8157B428F <removed>.Logging.Logger.Start(System.String)
000000D131D7E990 00007FF8157B3C6E <removed>.Program+<Main>d__0.MoveNext()
000000D131D7EAA0 00007FF85AC1209A System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[System.__Canon, System.Private.CoreLib]](System.__Canon ByRef) [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncMethodBuilderCore.cs @ 38]
000000D131D7EB10 00007FF8157B3B35 <removed>.Program.Main(System.String[])
000000D131D7EB80 00007FF8157B3A34 <removed>.Program.<Main>(System.String[])
OS Thread Id: 0x7d70
        Child SP               IP Call Site
000000D13297FA80 00007ff913170af4 [DebuggerU2MCatchHandlerFrame: 000000d13297fa80]
OS Thread Id: 0x6600
        Child SP               IP Call Site
000000D13237DC20 00007ff913173b94 [DynamicHelperFrame: 000000d13237dc20]
000000D13237DD80 00007FF85A8EE7D9 System.Threading.ThreadPoolWorkQueue..ctor() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs @ 426]
000000D13237DE60 00007FF85A8E025E System.Threading.ThreadPool..cctor() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs @ 1354]
000000D13237F000 00007ff87532b9e3 [DynamicHelperFrame: 000000d13237f000]
000000D13237F160 00007FF85A8E005A System.Threading.ThreadPool.get_CompletedWorkItemCount() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPool.Windows.cs @ 204]
000000D13237F190 00007FF85AA077CA System.Diagnostics.Tracing.RuntimeEventSource+<>c.<OnEventCommand>b__35_9() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/RuntimeEventSource.cs @ 110]
000000D13237F1C0 00007FF85AA0567F System.Diagnostics.Tracing.IncrementingPollingCounter.UpdateMetric() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/IncrementingPollingCounter.cs @ 57]
000000D13237F220 00007FF85A9EDCC8 System.Diagnostics.Tracing.CounterGroup.ResetCounters() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/CounterGroup.cs @ 206]
000000D13237F290 00007FF85A9ED9FE System.Diagnostics.Tracing.CounterGroup.EnableTimer(Single) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/CounterGroup.cs @ 158]
000000D13237F2F0 00007FF85A9ED74F System.Diagnostics.Tracing.CounterGroup.OnEventSourceCommand(System.Object, System.Diagnostics.Tracing.EventCommandEventArgs) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/CounterGroup.cs @ 73]
000000D13237F370 00007FF85A9F9AD9 System.Diagnostics.Tracing.EventSource.DoCommand(System.Diagnostics.Tracing.EventCommandEventArgs) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 2656]
000000D13237F3E0 00007FF85A9F98A2 System.Diagnostics.Tracing.EventSource.SendCommand(System.Diagnostics.Tracing.EventListener, System.Diagnostics.Tracing.EventProviderType, Int32, System.Diagnostics.Tracing.EventCommand, 
Boolean, System.Diagnostics.Tracing.EventLevel, System.Diagnostics.Tracing.EventKeywords, System.Collections.Generic.IDictionary`2<System.String,System.String>) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 2574]
000000D13237F460 00007FF85A9FED8B System.Diagnostics.Tracing.EventSource+OverrideEventProvider.OnControllerCommand(System.Diagnostics.Tracing.ControllerCommand, System.Collections.Generic.IDictionary`2<System.String,System.String>, Int32) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 2438]
000000D13237F4C0 00007FF85A9F01BF System.Diagnostics.Tracing.EventPipeEventProvider.HandleEnableNotification(System.Diagnostics.Tracing.EventProvider, Byte*, Byte, Int64, Int64, EVENT_FILTER_DESCRIPTOR*) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventPipeEventProvider.cs @ 58]
000000D13237F510 00007FF85A9F51A7 System.Diagnostics.Tracing.EventProviderImpl.ProviderCallback(System.Diagnostics.Tracing.EventProvider, Byte*, Int32, Byte, Int64, Int64, EVENT_FILTER_DESCRIPTOR*) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventProvider.cs @ 1310]
000000D13237F570 00007FF85A9F027E System.Diagnostics.Tracing.EventPipeEventProvider.Callback(Byte*, Int32, Byte, Int64, Int64, EVENT_FILTER_DESCRIPTOR*, Void*) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventPipeEventProvider.cs @ 68]
> syncblk                                                                                                                                                                                                                   
Index         SyncBlock MonitorHeld Recursion Owning Thread Info          SyncBlock Owner
    1 0000020A0011AEC8            3         1 0000020A3F0F6130 6600   1   0000020a41c0b7f8 System.Collections.Generic.List`1[[System.WeakReference`1[[System.Diagnostics.Tracing.EventSource, System.Private.CoreLib]], System.Private.CoreLib]]
-----------------------------
Total           1
CCW             0
RCW             0
ComClassFactory 0
Free            0

Reproduction Steps

Unsure, but when it happens, restarting the program/debug session will usually fix it.

Expected behavior

Program starts the thread pool queue just fine and no deadlock occurs.

Actual behavior

Deadlock due to the lock never being released.

Regression?

No response

Known Workarounds

No response

Configuration

Other information

No response

dotnet-policy-service[bot] commented 1 month ago

Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.

colejohnson66 commented 1 month ago

I'm not sure if I'm doing this right, but dumping the memory at the IP of the "dynamic helper frame" shows repeated 32 byte blocks of memory:

> db 7ff913173b94
00007ff913173b94: c3 cd 2e c3 0f 1f 84 00 00 00 00 00 4c 8b d1 b8  ............L...
00007ff913173ba4: e1 01 00 00 f6 04 25 08 03 fe 7f 01 75 03 0f 05  ......%.....u...
00007ff913173bb4: c3 cd 2e c3 0f 1f 84 00 00 00 00 00 4c 8b d1 b8  ............L...
00007ff913173bc4: e2 01 00 00 f6 04 25 08 03 fe 7f 01 75 03 0f 05  ......%.....u...
00007ff913173bd4: c3 cd 2e c3 0f 1f 84 00 00 00 00 00 4c 8b d1 b8  ............L...
00007ff913173be4: e3 01 00 00 f6 04 25 08 03 fe 7f 01 75 03 0f 05  ......%.....u...
00007ff913173bf4: c3 cd 2e c3 0f 1f 84 00 00 00 00 00 4c 8b d1 b8  ............L...
00007ff913173c04: e4 01 00 00 f6 04 25 08 03 fe 7f 01 75 03 0f 05  ......%.....u...

C3 is RET, yet it never returns? It looks like the IP is inside ntdll.dll?

> lm
... trimmed ...
00007FF9130D0000 00217000 C:\Windows\System32\ntdll.dll
Total image size: 72019968
> sos u 7ff913173b94
Unmanaged code
mangod9 commented 1 month ago

Hey @colejohnson66, are you able to share a repro for this issue?

colejohnson66 commented 1 month ago

We unfortunately do not have anything we could share, and I haven’t narrowed anything down. I still have the dump if you’d like anything from it, and I can do some more debugging if necessary.

I wonder if #104457 is related? It being intermittent would be an indication of some race condition.

KalleOlaviNiemitalo commented 1 month ago

OS thread 0x6600 has locked EventListener.EventListenersLock, and I think it is waiting for the static constructor of FrameworkEventSource.

OS thread 0xbd4 has entered the static constructor of FrameworkEventSource, and is waiting to lock EventListener.EventListenersLock.

I wonder what caused EventPipeEventProvider.Callback to be called. Had you connected some diagnostic tools to the process before the deadlock occurred?

EventSource.SendCommand has code to enqueue commands if the EventSource hasn't finished initializing yet. Could the deadlock be fixed by making RuntimeEventSource also enqueue commands until FrameworkEventSource finishes initializing?

Alternatively, could the ThreadPool counters implemented in RuntimeEventSource.OnEventCommand return 0 if ThreadPool hasn't been initialized yet? Rather than wait for ThreadPool and FrameworkEventSource to be initialized.

KalleOlaviNiemitalo commented 1 month ago

I wonder if https://github.com/dotnet/runtime/issues/104457 is related?

No, that's for TraceSource, and this deadlock is with EventSource.

colejohnson66 commented 1 month ago

This only started recently, so maybe Rider changed how something in the backend in the 2024.2 EAPs? I'm not ruling that possibility out.

My reproduction steps are: click the "debug" button, observe the debugger start, but not the GUI, click "pause" to investigate, then see the deadlock. Every time, it's the same stack trace. After a few times of this happening, I used dotnet-dump instead of pausing, and I got the same stack trace - the one above.

As for program execution, we have an async Task Main() (giving an async state machine), and the very first thing it does is start our SQLite-based logger. This calls into Microsoft.Data.SQLite to open the database. That stack trace on OS thread 0xBD4 is the first thing our program executes:

internal static class Program
{
    [STAThread]
    public static async Task Main(string[] args)
    {
        Logger.Instance.Start("Logs.db");

I'll see if the callback parameters have any information next time this happens.

colejohnson66 commented 1 month ago

I can't access the stack variables in the debugger:

An IL variable is not available at the current native IP. (0x80131304). The error code is CORDBG_E_IL_VAR_NOT_AVAILABLE, or 0x80131304.

What I can see is that the event counter being incremented in threadpool-completed-items-count. This then runs the static constructor of ThreadPool, which constructs the ThreadPoolWorkQueue.

Also, if the Timer used by SQLiteConnectionFactory manages to win the race to initialize the thread pool, the program starts fine. It has this stack trace:

new ThreadPoolWorkQueue()
static ThreadPool()
TimerQueue.SetTimer()
TimerQueue.EnsureTimerFiresBy()
TimerQueueTimer.Change()
new TimerQueueTimer()
Timer.TimerSetup()
new Timer()
new SqliteConnectionFactory()
static SqliteConnectionFactory()
SqliteConnection.set_ConnectionString()
new SqliteConnection()
Logger.OpenDatabaseCore()
Logger.OpenDatabaseReadWrite()
Logger.Start()
async Program.Main()
AsyncMethodBuilderCore.Start<{removed}.Program.<Main>d__0>()
Program.Main()
Program.<Main>()
KalleOlaviNiemitalo commented 1 month ago

This may be a duplicate of https://github.com/dotnet/runtime/issues/93175.

dotnet-policy-service[bot] commented 14 hours ago

Tagging subscribers to this area: @tarekgh, @tommcdon, @pjanotti See info in area-owners.md if you want to be subscribed.