Open dmex opened 3 years ago
Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.
Author: | dmex |
---|---|
Assignees: | - |
Labels: | `area-Diagnostics-coreclr`, `untriaged` |
Milestone: | - |
CC @agocke for insight on what might be different when specifying the runtime identifier. If I recall, that would be packaging the app in a single-file bundle under .net5, correct?
CC @BruceForstall in case there's some difference in how codegen operates under this configuration, though I am not aware of one.
@dmex what version of the SDK did you use to build the target app? (run dotnet --info
to get all the details) Do you see this same behavior under 3.1 or 6.0 with the same configuration?
CC @BruceForstall in case there's some difference in how codegen operates under this configuration
I guess it depends on exactly how the configurations are different, from the JIT perspective. Is this x64 code in all cases, or is one x86 and one x64? Are there different optimization switches? Stack walking presumably works fine when done from the CLR itself, so the question is what does StackWalk require that isn't being generated? Maybe unwind info isn't reported to the OS? I think the CLR does that lazily.
Do you see this same behavior under 3.1 or 6.0 with the same configuration?
I was able to reproduce the same behavior with .NET Core 2.0, 3.0, 3.1, 5.0 and 6.0. @josalem
Versions prior to 5.0 looked completely valid just with unresolved frames but a more careful examination showed all addresses were actually invalid, so I didn't notice all versions had the same issue and thought it was a regression.
Maybe unwind info isn't reported to the OS?
Thanks for that tip.
I took a closer look at the unwind table and traced the problem to the SymGetModuleBase64 function and the SymFunctionTableAccess64 function. These two functions internally lookup the unwind information from the process and return 0 when hitting a .NET core CLR frame - dbghelp.dll!DbhStackServices::GetFunctionEntry
doesn't seem to validate the return result of 0 indicating an error and continues walking the stack with an invalid state.
That's one problem but I found the actual issue... An almost identical bug was reported and fixed with #7675 that affected the MiniDumpWriteDump function and the MiniDumpAuxiliaryDlls
registry key.
The KnownFunctionTableDlls
key is missing the required version of mscordaccore.dll (distributed with the application) and without that it prevents SymGetModuleBase64/SymFunctionTableAccess64 from unwinding stack information.
I manually added a registry key to KnownFunctionTableDlls with the path to that version of mscordaccore.dll with the application (e.g. D:\Projects\ConsoleApp6\bin\Debug\net5.0\win-x64\mscordaccore.dll
):
This fixed StackWalk64/StackWalkEx and also fixed DbgShell which uses the dbgeng.dll COM debugger interface:
Tracing some of the windbg calls shows they're not calling either of those dbghelp functions and providing their own versions to the stack walk functions and they walk the modules and manually load mscordaccore.dll from the application' directory which is why they don't have the same problem as dbgeng.dll/DbgShell and dbghelp.dll/Process Hacker when walking thread stacks.
Should #7675 be re-opened since this issue is the exact same problem reported in that issue? It seems someone forgot to implement the same fix for MiniDumpWriteDump in the SymGetModuleBase64/SymFunctionTableAccess64 functions?
We could write our own implementation of SymFunctionTableAccess64 that does the same as windbg and look for the binary containing the MINIDUMP_AUXILIARY_PROVIDER resource and use that for loading the function tables (we're already testing this and it works) but I don't think dbgeng clients like DbgShell (not to mention everyone) can provide those callbacks (or do the required signature checks for security), so they'll still be stuck with broken stack enumeration until these functions are updated?
Maybe @josalem @noahfalk @tommcdon know whether Core needs a new design to support the SymGetModuleBase64/SymFunctionTableAccess64 functions similar to the fix in #7675 -- presumably Core can't use the registry key KnownFunctionTableDlls.
Sorry it looks like that original ping back in September was while I was out on sabbatical and I missed it. Thanks @dmex for the great investigation! This will probably take some further investigation on how best to resolve the issue, but my initial thoughts on solutions in order of preference:
This path ideally completes debugger/OS support for dynamic function tables in a way that lets any dynamic code generator participate without needing to jump through hoops of distributing custom dlls like mscordaccore. It also avoids substantial annoyances for debugging tools trying to properly acquire these dlls and validate that it would be safe to run code provided by a 3rd party.
Add a new resource, similar to the existing DUMP_HELPER resource whose purpose is to provide an alternative lookup mechanism for KnownFunctionTableDlls. The advantage of using a new resource is that nothing guarantees that the dll implementing dump generation logic needs to be the same dll as the one implementing function table inspection logic. For CLR we happened to use the same dll for both because it was convenient, but its a fairly arbitrary choice. Some runtimes might also want to support just one of these extensibility points without any implication that they must support the other. Once the resource was present, all the places that currently do KnownFunctionTableDlls lookup would need to be updated to support the new resource.
Same as (2) except reuse the existing DUMP_HELPER resource rather than create a new one.
We aren't going to have time to get to this in .NET 7 so I've put it on the backlog for .NET 8 for now. For anyone needing to resolve this more quickly @dmex had some good suggestions on workarounds above.
Moving to future as there are workarounds and this doesn't fit into our priorities for .NET 8. We will reconsider in future releases as we receive feedback.
A few things from this investigation:
UnwindInfoTable::PublishUnwindInfo
in CEEJitInfo::WriteCode
before the call to PublishUnwindInfoForMethod
is enough to initialize the infrastructure once as in the patch below. However, since this means that the CodeManager now publishes to this table per compilation, and there's kernel locks in this path, we probably want to perf test if it regresses startup/heavy loading perf.
diff --git a/src/coreclr/vm/jitinterface.cpp b/src/coreclr/vm/jitinterface.cpp
index e813d6f2d27..b8535c9fd0b 100644
--- a/src/coreclr/vm/jitinterface.cpp
+++ b/src/coreclr/vm/jitinterface.cpp
@@ -10783,6 +10783,10 @@ void CEEJitInfo::WriteCode(EEJitManager * jitMgr)
#if defined(TARGET_AMD64)
// Publish the new unwind information in a way that the ETW stack crawler can find
_ASSERTE(m_usedUnwindInfos == m_totalUnwindInfos);
UnwindInfoTable::PublishUnwindInfoForMethod(m_moduleBase, m_CodeHeader->GetUnwindInfo(0), m_totalUnwindInfos);
InstallEEFunctionTable
which registers callback based unwind info. One in creation of code heaps, which perhaps could be replaced with the growable table initializations in lieu of doing it in the CEEJitInfo
path. The other place is when emitting unwind sequences in the stublinker. We'll need to keep a path the DAC/debugger can query.RtlAddFunctionTable
is not recommended according to docs. It's currently used to publish exception unwind info in ConvertedImageLayout
file-backed R2R images on windows, which doesn't seem to be used - we don't use a flat layout of r2r images on Windows. However - we probably want to turn this one to a growable table as well.@jkotas do you know if there's any prior concerns with growable function tables?
Thanks for all the info @hoyosjs! A little confusion on one part: "We only publish into the growable table if there's been an event active that gets emitted through the ETW callback". Looking at the code I don't think this requires sending an event, it just requires that the EventSource has been turned on with relevant keywords. Is there some dependency on an event getting sent that I missed?
do you know if there's any prior concerns with growable function tables?
We should expand the support to ARM64 - right now it's limited to windows x64.
Why would we want to do that? Arm64 uses different stackwalking strategy. The unwindinfo is not necessary to get good diagnostic stacktrace on Arm64.
RtlAddFunctionTable
is not recommended according to docs.
Can you share the link to the docs that says this? I think it is perfectly fine to use RtlAddFunctionTable
when the usage pattern matches the API design.
Thanks for all the info @hoyosjs! A little confusion on one part: "We only publish into the growable table if there's been an event active that gets emitted through the ETW callback". Looking at the code I don't think this requires sending an event, it just requires that the EventSource has been turned on with relevant keywords. Is there some dependency on an event getting sent that I missed?
Sorry - yes. You need to enable jit rundown rather than necessarily send an event per se.
Publishing of the growable function tables has race conditions. There are small windows where the unwind is not published correctly. These race conditions are ok for diagnostic (there are number of other corner cases where the diagnostic stackwalk does not work), but they would not be acceptable for functionality (it would not be ok for you app to crash intermittently). If we were to switch to growable function tables as the only source of unwind info, we would need to do something about this.
Agree that switching. At some point, getting the information off the lists where it's at might me interesting. However, at first I'd focus on being able to emit the . Do you fear that would be too big an overhead?
Why would we want to do that? Arm64 uses different stackwalking strategy. The unwindinfo is not necessary to get good diagnostic stacktrace on Arm64.
In general WinDBG et al use these as streams to be able to find the exception unwind sequence. If other methods are implemented is something I am not familiar with at the moment, I'd have to reach out.
Can you share the link to the docs that says this? I think it is perfectly fine to use RtlAddFunctionTable when the usage pattern matches the API design.
Looks like the push away from it is mostly related to ARM64EC which we don't support: https://learn.microsoft.com/en-us/windows/arm/arm64ec-abi#dynamically-generating-jit-compiling-arm64ec-code. I can't find another documented reason to prefer growable over non-growable. And since the exception table in r2r is constant, I don't see why we should go the growable path.
That being said, diagnostic tools for any released version of windows won't be able to analyze non-FD-msi runtimes. The only way to do it right now is via a full dump with the DAC. This was never an issue with framework. At the very least we should document the workaround for self-contained apps since most tools will end up relying on stack unwinding heuristics otherwise.
Hello,
Not sure where to post this one. There seems to be some sort of codegen(?) issue when adding the
win-x64
runtime identifier to the csproj file that breaks the StackWalk64 and StackWalkEx functions on Windows 10.For example this csproj:
This source code:
Will break both of those functions until RuntimeIdentifier is removed. It doesn't seem to break windbg or SOS since neither of them use StackWalk64/StackWalkEx but rather breaks everyone else that does use those functions. For example:
Microsoft DbgShell shows an invalid stack trace for the main thread:
Process Hacker also shows an invalid stack:
It also breaks other software but these are the only two needed for the bug report. After removing the RuntimeIdentifier from the csproj (delete:
<RuntimeIdentifier>win-x64</RuntimeIdentifier>
) and recompiling the binary, the stack enumeration using StackWalk64/StackWalkEx works perfectly fine.Microsoft DbgShell now shows the correct stack (without DAC):
Process Hacker also now shows the correct stack (with DAC):
You can see from the PH screenshots that frame 2 is
ILStubClass.IL_STUB_PInvoke
but for the binaries compiled with RuntimeIdentifier the return address for that frame was 0x1 which seems to break the stack?I rewrote the Process Hacker stack enumeration to use StackWalkEx thinking it was an issue on our end but the problem remained... I've debugged the PH code and searched the donet repo for any relevant code but couldn't find anything obvious for the behavior so I'm inclined to believe this is some sort of codegen issue?
Reproduction steps:
1) Download DbgShell: https://github.com/microsoft/DbgShell 2) Compile and execute binary with
win-x64
RuntimeIdentifier using above code. 3) DbgShell execute:Connect-Process xxxx
replacing x with the process id. 4) DbgShell switch to thread 0:cd ..\0
5) DbgShell execute:k
for thread stack. 6) Invalid stack. 7) Remove thewin-x64
RuntimeIdentifier and recompile. 8) Valid stack.Configuration
Which version of .NET is the code running on? net5.0
What OS and version, and what distro if applicable? Windows 10, 21H1, 19043.1165, x64,
What is the architecture (x64, x86, ARM, ARM64)? x64
Do you know whether it is specific to that configuration? No