dotnet / diagnostics

This repository contains the source code for various .NET Core runtime diagnostic tools and documents.
MIT License
1.18k stars 354 forks source link

Single File module scan takes a very long time (symbol lookup) #3737

Open Moonfish opened 1 year ago

Moonfish commented 1 year ago

Background and Motivation

I regularly use SOS to diagnose issues (particularly from user submitted crash dumps), but also live debugging sessions. The particular software application in my case is a large complex hybrid composed of both managed and native components. Quite often, hundreds of native DLLs will be loaded into the process. As soon as !pe or !clrstack is invoked, it appears that these commands first blindly (attempt to) load symbols for every DLL that's loaded in the process, regardless of whether it's managed or not. Loading hundreds of symbol files can be excruciating slow, and it doesn't seem to matter if they are locally cached or not, so there is a huge delay for each dmp file that needs to be examined.

In native debugging generally, symbols are more intelligently loaded on demand. In addition, there would seem to be little value in loading symbols for native DLLs to support a managed debugging extension.

This behavior was definitely not always present in SOS, though I'm not sure just when it was introduced. If for some reason, full symbols are actually necessary, why not let the user decide? I'd rather see a warning than wait for all symbols to load, especially if none of them will actually appear in the stack or be relevant to the managed issue at hand. Assuming that all managed applications are entirely composed of managed assemblies is naive in the extreme.

Please let me know if there is an existing work-around for the problem described.

mikem8361 commented 1 year ago

I'm assuming you are using windbg to debug these crash dumps (but most of this applies to lldb for linux/macos dumps too). The SOS commands themselves only load managed assembly PDBs on demand (i.e. !clrstack only downloads them for the frames it displays and only if .lines has been enabled). The PDBs are cached so when the command or session with the same dump is run again, the PDBs are loaded from the cache.

What you might be seeing is that SOS when it looks for the runtime module (i.e. for Windows coreclr.dll), it needs to enumerate all the native modules when depending on how windbg symbol loading is set up will attempt to download the native PDB for each native module. I think that is the default for windbg but there is a way to disable it; I'll have to look into that. The native module PDBs are also cached on disk but it still takes a while to load them.

mikem8361 commented 1 year ago

It looks like my theory about what is happen to you may be incorrect. I tried it locally with the newest version of cdb (same engine as windbg) with !sym noisy enable and don't see it loading every native module PDB.

mikem8361 commented 1 year ago

Of course my test involved a small test app with not that many modules (either native or managed).

leculver commented 1 year ago

Maybe related to #675?

Moonfish commented 1 year ago

Y

I'm assuming you are using windbg to debug these crash dumps (but most of this applies to lldb for linux/macos dumps too). The SOS commands themselves only load managed assembly PDBs on demand (i.e. !clrstack only downloads them for the frames it displays and only if .lines has been enabled). The PDBs are cached so when the command or session with the same dump is run again, the PDBs are loaded from the cache.

What you might be seeing is that SOS when it looks for the runtime module (i.e. for Windows coreclr.dll), it needs to enumerate all the native modules when depending on how windbg symbol loading is set up will attempt to download the native PDB for each native module. I think that is the default for windbg but there is a way to disable it; I'll have to look into that. The native module PDBs are also cached on disk but it still takes a while to load them.

Yes, this is with CDB, and I think it might only be happening with crash dumps. Since I regularly analyze dmps where the process contains more than 400 loaded DLLs, it's quite impactful. I attached a second debugger during the time where I see the attempts to load symbols for each and every DLL. I'm guessing that either GetOffsetByName enters a loop which enumerates all symbols, or it's being called repeatedly with symbols that are not managed...

 # Child-SP          RetAddr               Call Site
00 0000000e`50cf5f98 00007ffe`5449306e     ntdll!NtWaitForSingleObject+0x14
01 0000000e`50cf5fa0 00007ffe`42204ea8     KERNELBASE!WaitForSingleObjectEx+0x8e
02 0000000e`50cf6040 00007ffe`421ffaff     wininet!CPendingSyncCall::HandlePendingSync_AppHangIsAppBugForCallingWinInetSyncOnUIThread+0xe0
03 0000000e`50cf6070 00007ffe`421b5556     wininet!INTERNET_HANDLE_OBJECT::HandlePendingSync+0x33
04 0000000e`50cf60a0 00007ffe`42167876     wininet!HttpWrapSendRequest+0x8db36
05 0000000e`50cf6240 00007ffe`11cd3dcd     wininet!HttpSendRequestW+0x96
06 0000000e`50cf6340 00007ffe`11cd396c     symsrv!StoreWinInet::request+0x4d
07 0000000e`50cf6380 00007ffe`11cd30e9     symsrv!StoreWinInet::fileinfo+0x7c
08 0000000e`50cf6440 00007ffe`11ccd2d4     symsrv!StoreWinInet::get+0x5d9
09 0000000e`50cf66a0 00007ffe`11cccc1d     symsrv!StoreHTTP::open+0x44
0a 0000000e`50cf6f30 00007ffe`11cc106e     symsrv!StoreHTTP::find+0x2cd
0b 0000000e`50cf6ff0 00007ffe`11cc246b     symsrv!cascade+0x156
0c 0000000e`50cf7460 00007ffe`11cc4bea     symsrv!<lambda_21fedf2ed308b0d16cbc6e424e59ba56>::operator()+0x49f
0d 0000000e`50cf7760 00007ffe`11cc2696     symsrv!Debugger::Utils::ConvertException<<lambda_21fedf2ed308b0d16cbc6e424e59ba56> >+0x12
0e 0000000e`50cf77a0 00007ffe`11cc1bd2     symsrv!SymbolServerByIndexAndChecksumsW+0xc6
0f 0000000e`50cf7840 00007ffe`035967c4     symsrv!SymbolServerW+0x162
10 0000000e`50cf78b0 00007ffe`0357ad13     dbghelp!symsrvGetFile+0x2fc
11 0000000e`50cf8170 00007ffe`0357c708     dbghelp!diaLocatePdb+0x5ab
12 0000000e`50cf8ed0 00007ffe`035a967e     dbghelp!diaGetPdb+0x2f8
13 0000000e`50cf91a0 00007ffe`035a79ea     dbghelp!GetDebugData+0x2e2
14 0000000e`50cf9430 00007ffe`035a7d8e     dbghelp!modloadWorker+0x2c6
15 0000000e`50cf9930 00007ffe`035a76d6     dbghelp!modload+0x102
16 0000000e`50cf9bd0 00007ffe`0359a922     dbghelp!LoadSymbols+0x4c6
17 0000000e`50cf9ce0 00007ffe`0359c53d     dbghelp!ModLoop+0x9e
18 0000000e`50cf9d70 00007ffe`035a1315     dbghelp!EnumSymbols+0xfd
19 0000000e`50cfbd70 00007ffd`c1e88d66     dbghelp!SymEnumSymbolsExW+0x55
1a 0000000e`50cfbdd0 00007ffd`c1e89183     dbgeng!EnumSymbolInfoRaw+0x2d2
1b 0000000e`50cfc1d0 00007ffd`c1d39fa4     dbgeng!GetOffsetFromSym+0x34b
1c 0000000e`50cfc550 00007ffd`c1d39ec3     dbgeng!DebugClient::GetOffsetByNameWide+0x94
1d 0000000e`50cfc5a0 00007ffd`ee476a0e     dbgeng!DebugClient::GetOffsetByName+0xd3
1e 0000000e`50cfc840 00007ffd`61fd6339     sos!DbgEngServices::GetOffsetBySymbol+0x15e [D:\a\_work\1\s\src\SOS\Strike\dbgengservices.cpp @ 439] 
Moonfish commented 1 year ago

Dug a little more, and it's now clear to me what's going on:

I've pasted the managed portion of the stack (obtained by attaching to windbg during repro). SOS is repeatedly calling GetOffsetBySymbol passing in a guessed string. It's doing this to find a symbol called DotNetRuntimeInfo, as part of it's initialization. It appears that in the case of crash dumps, it doesn't know which DLL that particular symbol lies within, so it uses the list of loaded DLLs, appends "!DotNetRuntimeInfo" to each, and passes that to the function; not a big deal if you only have a few DLLs loaded, but terrible when you have hundreds of them, plus a slow symbol server. The underlying debugger engine dutifully tries to load symbols for whatever DLL is in front of the bang (if it isn't already loaded.) so that it can hunt for DotNetRuntimeInfo.

You can see this by using bp dbgeng!DebugClient::GetOffsetByName "da @rdx;g"

Managed Stack

000000eac68fc910 00007ffdd6da23f9 DomainBoundILStubClass.IL_STUB_PInvoke(IntPtr, Int32, Byte*, UInt64 ByRef)
000000eac68fc9d0 00007ffdd6da22fc SOS.DebuggerServices.GetOffsetBySymbol(Int32, System.String, UInt64 ByRef)
000000eac68fca30 00007ffdd6da2201 SOS.Extensions.ModuleServiceFromDebuggerServices+ModuleFromDebuggerServices.TryGetSymbolAddressInner(System.String, UInt64 ByRef)
000000eac68fca70 00007ffdd6d9b71b Microsoft.Diagnostics.DebugServices.Implementation.Module.Microsoft.Diagnostics.DebugServices.IExportSymbols.TryGetSymbolAddress(System.String, UInt64 ByRef)
000000eac68fcae0 00007ffdd6d9ae9f Microsoft.Diagnostics.DebugServices.Implementation.RuntimeService.Microsoft.Diagnostics.Runtime.IExportReader.TryGetSymbolAddress(UInt64, System.String, UInt64 ByRef)
000000eac68fcb30 00007ffdd6d96cc4 Microsoft.Diagnostics.Runtime.DataTarget.GetOrCreateClrVersions()
000000eac68fce00 00007ffdd6d93c11 Microsoft.Diagnostics.DebugServices.Implementation.RuntimeService.EnumerateRuntimes()
000000eac68fce60 00007ffdd6d9307f Microsoft.Diagnostics.DebugServices.Implementation.ContextService.GetCurrentRuntime()
000000eac68fcec0 00007ffdd6d92b99 Microsoft.Diagnostics.DebugServices.Implementation.ServiceProvider.GetService(System.Type)
000000eac68fcf20 00007ffdd6d92b0b Microsoft.Diagnostics.DebugServices.ServiceProviderExtensions.GetService[[System.__Canon, mscorlib]](System.IServiceProvider)
000000eac68fcf60 00007ffdd6d92f8b SOS.Hosting.TargetWrapper.GetRuntime(IntPtr, IntPtr*)
000000eac68fcfb0 00007ffdd6d92e71 DomainBoundILStubClass.IL_STUB_ReversePInvoke(Int64, Int64)

Example output from BP listed above. This is just a very tiny subset of the DLLs (none of which are managed) that ultimately get loaded in the hunt for that symbol.

(6670.2d60): CLR exception - code e0434352 (first chance)
(6670.2d60): CLR exception - code e0434352 (first chance)
000002a3`8e0f5660  "NITFCoreLib!DotNetRuntimeInfo"
(6670.2d60): CLR exception - code e0434352 (first chance)
(6670.2d60): CLR exception - code e0434352 (first chance)
000002a3`8e0f56f0  "NITFRasterDB!DotNetRuntimeInfo"
(6670.2d60): CLR exception - code e0434352 (first chance)
(6670.2d60): CLR exception - code e0434352 (first chance)
000002a3`8e0f56f0  "Map!DotNetRuntimeInfo"
(6670.2d60): CLR exception - code e0434352 (first chance)
(6670.2d60): CLR exception - code e0434352 (first chance)
000002a3`b0f07370  "CartoXProvider!DotNetRuntimeInfo"
000002a3`b0f07390  ""
(6670.2d60): CLR exception - code e0434352 (first chance)
(6670.2d60): CLR exception - code e0434352 (first chance)
000002a3`b0f076b0  "MapServerLayer!DotNetRuntimeInfo"
000002a3`b0f076d0  ""
(6670.2d60): CLR exception - code e0434352 (first chance)
(6670.2d60): CLR exception - code e0434352 (first chance)
000002a3`8e0f53c0  "MappingRTC!DotNetRuntimeInfo"
(6670.2d60): CLR exception - code e0434352 (first chance)
(6670.2d60): CLR exception - code e0434352 (first chance)
000002a3`8e0f5600  "GpToolbox!DotNetRuntimeInfo"
(6670.2d60): CLR exception - code e0434352 (first chance)
(6670.2d60): CLR exception - code e0434352 (first chance)
000002a3`8e0f5600  "FileGDB!DotNetRuntimeInfo"
(6670.2d60): CLR exception - code e0434352 (first chance)
(6670.2d60): CLR exception - code e0434352 (first chance)
000002a3`8e0f5600  "OptimizerLib!DotNetRuntimeInfo"
(6670.2d60): CLR exception - code e0434352 (first chance)
(6670.2d60): CLR exception - code e0434352 (first chance)
000002a3`8df9f9a0  "GpTinFunctions!DotNetRuntimeInfo"
000002a3`8df9f9c0  ""
(6670.2d60): CLR exception - code e0434352 (first chance)
(6670.2d60): CLR exception - code e0434352 (first chance)
000002a3`8df9fc60  "EditingService!DotNetRuntimeInfo"

I believe the associated source is located here: \diagnostics\src\SOS\Strike\platform\runtimeimpl.cpp

@mikem8361 The change was added 4/9/2020 to support single-file exe debugging with SOS (commit f40388e4). I did a manual search for this symbol and wasn't able to find it.. so thinking maybe this doesn't apply to non-single-file applications? I searched for that symbol manually and couldn't locate it in any of our assemblies. Support for the single file case shouldn't result in decreased performance for everyone else.


/**********************************************************************\
 * Search all the modules in the process for the single-file host
\**********************************************************************/
static HRESULT GetSingleFileInfo(ITarget* target, PULONG pModuleIndex, PULONG64 pModuleAddress, RuntimeInfo** ppRuntimeInfo)
{
    _ASSERTE(pModuleIndex != nullptr);
    _ASSERTE(pModuleAddress != nullptr);

    // No debugger service instance means that SOS is hosted by dotnet-dump,
    // which does runtime enumeration in CLRMD. We should never get here.
    IDebuggerServices* debuggerServices = GetDebuggerServices();
    if (debuggerServices == nullptr) {
        return E_NOINTERFACE;
    }

    ULONG loaded, unloaded;
    HRESULT hr = g_ExtSymbols->GetNumberModules(&loaded, &unloaded);
    if (FAILED(hr)) {
        return hr;
    }

    const char* symbolName = "DotNetRuntimeInfo";
    for (ULONG index = 0; index < loaded; index++)
    {
    . . .
Moonfish commented 1 year ago

I'd like to build sos from source locally, but am having some difficulties with build environment/configuration. What I'm after is a version suitable for use with Windows, x64 release, and dotnet 6. I'm currently using VS2022 and have cmake installed. Only interested in SOS (native Strike part) and not the other utilities.

Any assistance with the appropriate arguments to pass to build.cmd would be appreciated.

leculver commented 1 year ago

build -c release will generate an x64 release version.

Any SOS will debug .Net 6 dumps, but SOS itself is compiled against desktop framework, not .Net 6.

Moonfish commented 1 year ago

Thanks @leculver

As a temporary measure to get us going, I simply altered GetSingleFileInfo() to return E_FAIL (we don't need to handle single exe cases). After this change, the performance difference is huge--essentially instantaneous--and our batch dmp processing times have accordingly dropped dramatically. Not sure if there is a better way to find the RuntimeInfo than by scanning all loaded DLLs, but that approach is a significant issue for users like us.

Thanks again.

leculver commented 1 year ago

@Moonfish Thanks for the info/investigation! Maybe this issue should be titled: "Single File module scan takes a very long time" or something similar. Is it ok if I rename/repurpose this thread?

@mikem8361 The underlying issue here is scanning every file for an export. There can be thousands of modules loaded, and we'll try to touch the symbol server for every one of them, especially in non-full dumps (most folks don't collect full dumps).

Is there a way for us to know which module is the "main" or startup module? We could either:

  1. Only scan the main module for the export. Requires us to actually know the "main" module
  2. Only scan for singlefile exports when we don't find another runtime. This means SingleFile debugging will not work in the SxS case where a singlefile program ends up loading another CLR at the same time. (COM activation of Desktop CLR C# code would be a good example.)
  3. Ensure the single-file dac always puts the export table into the dump so we never request it from the symbol server. Then redesign the startup path of ClrMD to not request the export table from the symbol server. The drawback is single-file debugging doesn't work until .Net 8.

I have heard complaints about perf from single-file from other places too. I don't think we can leave it as-is long term.

Thoughts?

Moonfish commented 1 year ago

Feel free to rename issue as appropriate.

mikem8361 commented 1 year ago

@mikem8361 The underlying issue here is scanning every file for an export. There can be thousands of modules loaded, and we'll try to touch the symbol server for every one of them, especially in non-full dumps (most folks don't collect full dumps).

Is there a way for us to know which module is the "main" or startup module? We could either:

For Windows exes and MacOS binaries we can but not for ELF Linux binaries.

  1. Only scan the main module for the export. Requires us to actually know the "main" module

Obviously this would be best. Doesn't work for Linux.

  1. Only scan for singlefile exports when we don't find another runtime. This means SingleFile debugging will not work in the SxS case where a singlefile program ends up loading another CLR at the same time. (COM activation of Desktop CLR C# code would be a good example.)

That would only happen on Windows so it would work for Linux and not necessary for OSX.

  1. Ensure the single-file dac always puts the export table into the dump so we never request it from the symbol server. Then redesign the startup path of ClrMD to not request the export table from the symbol server. The drawback is single-file debugging doesn't work until .Net 8.

The intent was that the export table containing the DotNetRuntimeInfo symbol and the table itself are always part of the core dump (createdump makes sure for Linux/MacOS). For Windows, it would have to be part of the DAC which I don't think it is now.

But all of that is moot if we go through all of the other modules in the system looking at their export tables which might cause the dll to be downloaded if their export tables weren't in the dump but the single-file app exe or the coreclr module's are.

I have heard complaints about perf from single-file from other places too. I don't think we can leave it as-is long term.

Thoughts?

Not any good ones :).

mikem8361 commented 4 months ago

@Moonfish, I was looking at this issue again and noticed that the reason the C++ GetSingleFileInfo is being called because you don't have a .NET Core runtime or SDK installed in the standard place under C:\Program Files\dotnet or somehow you don't have the desktop Framework installed. SOS now has a lot of C# managed code which includes better code to inspect modules for the DotNetRuntimeInfo single-file export. The C++ GetSingleFileInfo is the fallback function if is can't find a runtime to host SOS's C# code. The C# code to find the export doesn't call windbg. It may still need to download the DLL if the dump doesn't have the export table for the module, but that should be rare.

I'm going to continue to investigate what we can do to make this faster, but you may want to update to the latest SOS and install a .NET 6 or greater runtime on the machine you are debugging the dump.