DataDog / dd-trace-dotnet

.NET Client Library for Datadog APM
https://docs.datadoghq.com/tracing/
Apache License 2.0
444 stars 141 forks source link

`Kernel32!Pss*Snapshot` Thread Capture Counts are not the same as Active Thread Counts for `runtime.dotnet.threads.count` in DD Tracer v2.49+ #6172

Open bchavez opened 2 days ago

bchavez commented 2 days ago

Hello,

I noticed there was a change in DD Tracer for .NET in v2.49 here:

Prior to PR #5304 being merged (and DD Tracers < v4.29 releases), I think the DD Runtime metric for runtime.dotnet.threads.count was more closely related to Process.Threads.Count.

After PR #5304, was merged (and DD Tracers >= v4.29 releases), I think the DD Runtime metric for runtime.dotnet.threads.count on Windows 8.1+ platforms now come from Kernel32!Pss*Snapshot.

Lastly, there is a gate/guard that prevents Kernel32!Pss*Snapshot from P/Invoke calls below, which can create further discontinuity when looking at the same apps running on different platforms on Linux vs Windows 8.1+ platforms:

https://github.com/DataDog/dd-trace-dotnet/blob/a200693cd84f3bbf8c1c13867b323e9909b56043/tracer/src/Datadog.Trace/RuntimeMetrics/RuntimeMetricsWriter.cs#L318-L322

The source of this discontinuity seems to be related to how these actual thread counts are sourced in the NT Kernel. That is, the "Threads Captured" from calling Kernel32!Pss*Snapshot can be different from the number reported by Process.Threads.Count (eg "Active Threads").

In other words, I don't think we should assume that when calling Kernel32!Pss*Snapshot that the PSS_THREAD_INFORMATION.ThreadsCaptured is somewhat equivalent to the previous reporting of runtime.dotnet.threads.count which previously used Process.Threads.Count "Active Threads" to report threading metrics:

https://github.com/DataDog/dd-trace-dotnet/blob/a200693cd84f3bbf8c1c13867b323e9909b56043/tracer/src/Datadog.Trace/RuntimeMetrics/ProcessSnapshotRuntimeInformation.cs#L93

https://github.com/DataDog/dd-trace-dotnet/blob/a200693cd84f3bbf8c1c13867b323e9909b56043/tracer/src/Datadog.Trace/RuntimeMetrics/RuntimeMetricsWriter.cs#L343

Basically, what I think is missing in PR #5304 is filtering the snapshot against PSS_THREAD_FLAGS_NONE here:

Ultimately, if we filter the "Captured Threads" from Kernel32!Pss*Snapshot against PSS_THREAD_FLAGS_NONE then we will more closely align DD metric runtime.dotnet.threads.count with Process.Threads.Count (eg "Active Threads"). To filter threads by PSS_THREAD_FLAGS_NONE we'd likely need to walk the snapshot handle and apply filtering & count and simply not take threadCount = threadInformation.ThreadsCaptured at face value like we do now.

Otherwise, the PSS_THREAD_INFORMATION.ThreadsCaptured we are sourcing from Kernel32!Pss*Snapshot will include both PSS_THREAD_FLAGS_NONE and PSS_THREAD_FLAGS_TERMINATED (which are not "active threads" from the NT Kernel's perspective).

Supporting Evidence

With some reverse engineering and NT Kernel debugging, we find the following supporting evidence below.

Let's consider the following simple .NET program:

using System.Diagnostics;
using static System.Console;

namespace HelloWorld
{
   internal class Program
   {
      static void Main(string[] args)
      {
         using var p = Process.GetCurrentProcess();
         WriteLine($"Hello, World! My PID={p.Id}");
         ReadLine();
      }
   }
}

We run the program above that prints out the PID of the HelloWorld.exe program. Additionally, we have two "monitoring" programs that use the Kernel32!Pss*Snapshot API:

Pointing Monitor.exe and SnapshotCPP.exe to HelloWorld.exe; and running procexp64 and Task Manager with Threads column, we find the following:

Image

We observe:

So right out of the gate, we are finding a slight discontinuity of the actual number of threads for the given process. Kernel32!Pss*Snapshot "Captured Threads" vs Process.Threads.Count.

Let's dig a bit more, let's try to find where Kernel32!Pss*Snapshot and procexp64 and Task Manager with Threads column are getting these thread counts...

πŸš€ πŸ’ A small trip into Kernel space ring0

Breaking WinDbg into the NT Kernel; our target for posterity:

Image

!process 0 0 we find HelloWorld.exe:

Image

kd> dt nt!_EPROCESS ffffc48741c30080:

Image

Image

And we find the _EPROCESS.ActiveThreads = 4 value field at offset 0x5f0;

Cool. So maybe a good guess is that the NT Kernel is sourcing its reporting of "Active Threads" aka Process.Threads.Count from the _EPROCESS NT Kernel structure. Well, we have accounted for 4 threads, but we are still short 1 terminated thread. Where is it?

What about this ONE PSS_THREAD_FLAGS_TERMINATED thread TID: 4348; where in the NT Kernel is this accounting information being kept?

🌊 β›΅ Traversing _EPROCESS.ThreadListHead

The adjacent field to ActiveThreads //0x5f0 is ThreadListHead //0x5e0:

Image

Again, best guess that _EPROCESS.ThreadListHead is a list of _ETHREAD structures and the _ETHREAD.ThreadListEntry is at field offset 0x4e8; so we need to subtract 0x4e8 from the first _EPROCESS.ThreadListHead.FLink to get the base address of the first _ETHREAD then walk the thread list from this base address with dt.

Image

kd> dx -id 0,0,ffffc48741c30080 -r1 (*((ntkrnlmp!_LIST_ENTRY *)0xffffc48741c30660))
(*((ntkrnlmp!_LIST_ENTRY *)0xffffc48741c30660))                 [Type: _LIST_ENTRY]
    [+0x000] Flink            : 0xffffc4874188c568 [Type: _LIST_ENTRY *]
    [+0x008] Blink            : 0xffffc4874188b568 [Type: _LIST_ENTRY *]
kd> ? 0xffffc4874188c568 - 0x4e8
Evaluate expression: -65389777600384 = ffffc487`4188c080
kd> dt nt!_ETHREAD Cid ffffc487`4188c080
   +0x478 Cid : _CLIENT_ID
kd> dx -id 0,0,ffffc48741c30080 -r1 (*((ntkrnlmp!_CLIENT_ID *)0xffffc4874188c4f8))
(*((ntkrnlmp!_CLIENT_ID *)0xffffc4874188c4f8))                 [Type: _CLIENT_ID]
    [+0x000] UniqueProcess    : 0x864 [Type: void *]
    [+0x008] UniqueThread     : 0x10f8 [Type: void *]
kd> ? 0x864
Evaluate expression: 2148 = 00000000`00000864
kd> ? 0x10f8
Evaluate expression: 4344 = 00000000`000010f8

Image

So it looks like we are in the right area if we can identify the first _ETHREAD.Cid // Client ID; matching PID=2148 and TID=4344 (alive; Flags:0).

Now let's enumerate all the _ETHREADs now that we have a base _ETHREAD address calculated at: ffffc487``4188c080:

kd> dt nt!_ETHREAD Cid -l ThreadListEntry.FLink ffffc487`4188c080
ThreadListEntry.FLink at 0xffffc487`4188c080
---------------------------------------------
   +0x478 Cid : _CLIENT_ID
   +0x4e8 ThreadListEntry :  [ 0xffffc487`40b8e568 - 0xffffc487`41c30660 ]

ThreadListEntry.FLink at 0xffffc487`40b8e080
---------------------------------------------
   +0x478 Cid : _CLIENT_ID
   +0x4e8 ThreadListEntry :  [ 0xffffc487`417c7568 - 0xffffc487`4188c568 ]

ThreadListEntry.FLink at 0xffffc487`417c7080
---------------------------------------------
   +0x478 Cid : _CLIENT_ID
   +0x4e8 ThreadListEntry :  [ 0xffffc487`40bc4568 - 0xffffc487`40b8e568 ]

ThreadListEntry.FLink at 0xffffc487`40bc4080
---------------------------------------------
   +0x478 Cid : _CLIENT_ID
   +0x4e8 ThreadListEntry :  [ 0xffffc487`4188b568 - 0xffffc487`417c7568 ]

ThreadListEntry.FLink at 0xffffc487`4188b080
---------------------------------------------
   +0x478 Cid : _CLIENT_ID
   +0x4e8 ThreadListEntry :  [ 0xffffc487`41c30660 - 0xffffc487`40bc4568 ]
kd> dx -id 0,0,ffffc48741c30080 -r1 (*((ntkrnlmp!_CLIENT_ID *)0xffffc4874188c4f8))
(*((ntkrnlmp!_CLIENT_ID *)0xffffc4874188c4f8))                 [Type: _CLIENT_ID]
    [+0x000] UniqueProcess    : 0x864 [Type: void *]
    [+0x008] UniqueThread     : 0x10f8 [Type: void *]
kd> dx -id 0,0,ffffc48741c30080 -r1 (*((ntkrnlmp!_CLIENT_ID *)0xffffc48740b8e4f8))
(*((ntkrnlmp!_CLIENT_ID *)0xffffc48740b8e4f8))                 [Type: _CLIENT_ID]
    [+0x000] UniqueProcess    : 0x864 [Type: void *]
    [+0x008] UniqueThread     : 0x53c [Type: void *]
kd> dx -id 0,0,ffffc48741c30080 -r1 (*((ntkrnlmp!_CLIENT_ID *)0xffffc487417c74f8))
(*((ntkrnlmp!_CLIENT_ID *)0xffffc487417c74f8))                 [Type: _CLIENT_ID]
    [+0x000] UniqueProcess    : 0x864 [Type: void *]
    [+0x008] UniqueThread     : 0xe78 [Type: void *]
kd> dx -id 0,0,ffffc48741c30080 -r1 (*((ntkrnlmp!_CLIENT_ID *)0xffffc48740bc44f8))
(*((ntkrnlmp!_CLIENT_ID *)0xffffc48740bc44f8))                 [Type: _CLIENT_ID]
    [+0x000] UniqueProcess    : 0x864 [Type: void *]
    [+0x008] UniqueThread     : 0x1628 [Type: void *]
kd> dx -id 0,0,ffffc48741c30080 -r1 (*((ntkrnlmp!_CLIENT_ID *)0xffffc4874188b4f8))
(*((ntkrnlmp!_CLIENT_ID *)0xffffc4874188b4f8))                 [Type: _CLIENT_ID]
    [+0x000] UniqueProcess    : 0x864 [Type: void *]
    [+0x008] UniqueThread     : 0x10fc [Type: void *]
kd> ? 0x10f8
Evaluate expression: 4344 = 00000000`000010f8
kd> ? 0x53c
Evaluate expression: 1340 = 00000000`0000053c
kd> ? 0xe78
Evaluate expression: 3704 = 00000000`00000e78
kd> ? 0x1628
Evaluate expression: 5672 = 00000000`00001628
kd> ? 0x10fc
Evaluate expression: 4348 = 00000000`000010fc

Image

And we've identified all 5 threads, alive: TID: 4344, TID: 1340, TID: 3704, TID: 5672 and the terminated TID: 4348 (dead) thread.

Lets compare both _ETHREAD alive and a dead threads to understand what the difference is; we'll dump the following TIDs:

Image

Image

Our first indication that something is drastically different is _ETHREAD.CrossThreadFlags.Terminated = 1 at offset 0x510 for the dead thread TID: 4348. Our second indication is that _ETHREAD.KernelStackReference = 0 at offset 0x55c for the dead thread TID: 4348. I am only guessing that since this TID: 4348 dead thread has no kernel stack reference, that the kernel-side callstack of this thread is no longer allocated and is gone and there's no chance in the world for this thread to 'revive' and can no longer make any more NT system service calls/transitions via syscall.

Image

Image

πŸ”Ž πŸ‘» Zooming Out and Getting out of whack

The problem can get so bad in heavily loaded and larger complex systems that I was able to create Trouble.exe that greatly exacerbates the discontinuity/discrepancy between Kernel32!Pss*Snapshot "Thread Capture" counts and Process.Threads.Count; see below:

Image

πŸ—Ί 🧭 Guidence and Direction, and Background Context

If we want DataDog Tracer's runtime.dotnet.threads.count to maintain closer parity with previous DD Tracer < v4.29 behavior and some semblance of parity/consistency across Linux and Windows thread reporting and operating-system monitoring/tool reporting, before PR #5304 was introduced, then we'll need to augment PR #5304 with additional code to ::PssWalkSnapshot( PSS_WALK_THREADS ) and filter/count for threads with PSS_THREAD_ENTRY.Flags = PSS_THREAD_FLAGS_NONE to get an accurate count of "active threads" only.

Background Context: We had a problem with DataDog Tracer after a DD Tracer update at my work (>= 2.49+) that lead us to believe there was a serious threading problem with one of our apps and such strange behavior between Windows vs Linux, and we weren't sure why. So I spent a few Sundays investigating this super fun problem and topic; and now we know why. πŸ™

Hope this helps anyone who is interested in the internals of DD Tracer runtime.dotnet.threads.count and Kernel32!Pss*Snapshot APIs on Windows 8.1+.

Thank you,
Brian Chavez

kevingosse commented 6 hours ago

Thanks for the extensive research! We're running short for the next release, but I will look into it for the one after.