App crashes with an output "Trace/Breakpoint Trap" on Linux when a P/Invoke callback is called from a native library if the dotnet debugger is attached.

walterlv commented 4 months ago

Description

Write a .NET 8 application that calls a native library using P/Invoke with a callback.
Run the app, then attach the dotnet debugger before the callback is called.
- Visual Studio Managed (.NET Core for Unix): https://learn.microsoft.com/en-us/visualstudio/debugger/remote-debugging-dotnet-core-linux-with-ssh?view=vs-2022#attach-the-debugger
- JetBrains Rider: https://www.jetbrains.com/help/rider/SSH_Remote_Debugging.html#debug-application-on-remote-machine
We'll see an output "Trace/Breakpoint Trap" and the app crashes.

Note: Not all native callbacks cause this issue so I've written a minimal reproducible example below.

Reproduction Steps

Minimal reproducible example 1:

Clone this repo: https://github.com/walterlv/Walterlv.Issues.TraceBreakpointTrap
build the demo to a linux machine
Run the app, then attach the dotnet debugger.

dotnet publish -c debug -r linux-x64 --self-contained

$ ./TraceBreakpointTrapDemo
### Trace/Breakpoint Trap issue on .NET debugger ###
Please attach a dotnet debugger and use 'Set next statement'.
Trace/breakpoint trap

Reproducible example 2:

https://github.com/Haltroy/CefGlue

Expected behavior

The app should not crash when the dotnet debugger is attached.

Actual behavior

The app crashes with an output "Trace/Breakpoint Trap".

Regression?

I've only tested this on .NET 8.0.302

Known Workarounds

I've found several workarounds:

Detect if the debugger is attached and don't call the callback.
Use the "Native (GDB)" or "Native (LLDB)" debugger instead of the "Managed (.NET Core for Unix)" debugger.

Note:

The Debugger.IsAttached property cannot detect the native debugger so I added alternative options --sleep <seconds> and --skip-attach for the minimal reproducible example above.
The native debugger is very difficult to use, so I hope this issue can be fixed.

Configuration

.NET: 8.0.302
OS:
- Ubuntu 22.04 LTS
- Debian 12
- UnionTech OS GNU/Linux 20
- Kylin V10 SP1
Architecture:
- x64
- ARM64

I didn't find any environment that doesn't have this issue.

Other information

dotnet tool install -g dotnet-sos
dotnet sos install
ulimit -c unlimited
Run echo "0x3F"> /proc/<pid>/coredump_filter after the process starts and the pid is known.
Attach the debugger and wait for the output Trace/Breakpoint Trap (core dumped).
lldb --core core TraceBreakpointTrapDemo

$ lldb --core core TraceBreakpointTrapDemo
SOS_HOSTING: Failed to find runtime directory
Unrecognized command 'setsymbolserver' because managed hosting failed or was disabled. See sethostruntime command for details.
(lldb) target create "TraceBreakpointTrapDemo" --core "core"
Core file '/home/uos/lvyi/Walterlv.Issue.TraceBreakpointTrap/core' (x86_64) was loaded.
(lldb) clrstack
OS Thread Id: 0x7ef9 (1)
        Child SP               IP Call Site
00007F4AF37DBA38 00007F4AF45F3B41 Walterlv.Issues.TraceBreakpointTrap.VolumeManager.ContextStateCallback(IntPtr, IntPtr)
(lldb) bt
* thread #1, name = 'TraceBreakpoint', stop reason = signal SIGTRAP
  * frame #0: 0x00007f4af45f3b41
    frame #1: 0x00007f4b6ba904f9 libpulse.so.0`___lldb_unnamed_symbol12$$libpulse.so.0 + 73
    frame #2: 0x00007f4b6ba93002 libpulse.so.0`___lldb_unnamed_symbol28$$libpulse.so.0 + 514
    frame #3: 0x00007f4b6ba931d2 libpulse.so.0`___lldb_unnamed_symbol29$$libpulse.so.0 + 98
    frame #4: 0x00007f4b6ba459b2 libpulsecommon-14.2.so`___lldb_unnamed_symbol101$$libpulsecommon-14.2.so + 258
    frame #5: 0x00007f4b6baa63c0 libpulse.so.0`pa_mainloop_dispatch + 672
    frame #6: 0x00007f4b6baa65cc libpulse.so.0`pa_mainloop_iterate + 60
    frame #7: 0x00007f4b6baa6670 libpulse.so.0`pa_mainloop_run + 32
    frame #8: 0x00007f4b6bab43f9 libpulse.so.0`___lldb_unnamed_symbol111$$libpulse.so.0 + 105
    frame #9: 0x00007f4b6ba51628 libpulsecommon-14.2.so`___lldb_unnamed_symbol119$$libpulsecommon-14.2.so + 88
    frame #10: 0x00007f4b73452fa3 libpthread.so.0`start_thread(arg=<unavailable>) at pthread_create.c:486
    frame #11: 0x00007f4b7305d60f libc.so.6`__GI___clone at clone.S:95
(lldb) dis
->  0x7f4af45f3b41: subq   $0x20, %rsp
    0x7f4af45f3b45: leaq   0x20(%rsp), %rbp
    0x7f4af45f3b4a: movq   %rdi, -0x8(%rbp)
    0x7f4af45f3b4e: movq   %rsi, -0x10(%rbp)
    0x7f4af45f3b52: movq   %rdx, -0x18(%rbp)
    0x7f4af45f3b56: cmpl   $0x0, 0x897d3(%rip)
    0x7f4af45f3b5d: je     0x7f4af45f3b64
(lldb)

dotnet-policy-service[bot] commented 4 months ago

Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.

tommcdon commented 4 months ago

Hi @walterlv! Thanks for reporting this bug!

I didn't find any environment that doesn't have this issue.

Do you know if this issue reproduces on Windows?

tommcdon commented 4 months ago

Do you know if this issue reproduces on Windows?

Ahh nevermind this question as the repro is very specific to linux.

Do you know if the callback/debugging issue is specific to the libpulse API (e.g. does a standalone repo that uses callback from C++ to C# on Linux reproduce the issue)? I am curious if there is something specific to libpulse that is causing the problem, for example a difference in calling convention, etc...

lindexi commented 4 months ago

@tommcdon I can repro this issues by @walterlv 's repo in my linux system. And I can sure it's not the libpulse bug, because I can repro this issues with https://github.com/Haltroy/CefGlue

I can not reproduce on Windows because I fail to run the libpulse on Windows... I mean I do not know if it can be reproduced on Windows.

tommcdon commented 4 months ago

Possible duplicate to https://github.com/dotnet/runtime/issues/102767. @hoyosjs

walterlv commented 4 months ago

Thanks to my friend @kkwpsv, he helped me to find out more information about this issue.

@tommcdon This issue is quite different from #102767:

This issue is related to the dotnet debugger on linux (and only on linux).
This issue might not related to the callback but I can't figure out whether it is or not.

Let's see more details here.

Debug run the app using a dotnet debugger (I was using the JetBrains Rider linux version) and let the app stops at a breakpoint.
Attach lldb to the running process.
Continue the app in the dotnet debugger.
Continue the app in the lldb debugger.

Then,

See all the threads in the lldb debugger using thread backtrace all and we that thread 3 .NET EventPipe is stopped with signal SIGTRAP
Resume the app and the thread 3 receives a detail signal signal SIGSEGV: address not mapped to object (fault address: 0xbafa13a0).

The stack traces are shown as follows:

https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventPipeEventProvider.cs

[UnmanagedCallersOnly]
private static unsafe void Callback(byte* sourceId, int isEnabled, byte level,
    long matchAnyKeywords, long matchAllKeywords, Interop.Advapi32.EVENT_FILTER_DESCRIPTOR* filterData, void* callbackContext)
{
    EventPipeEventProvider _this = (EventPipeEventProvider)GCHandle.FromIntPtr((IntPtr)callbackContext).Target!;
    if (_this._eventProvider.TryGetTarget(out EventProvider? target))
    {
        _this.ProviderCallback(target, sourceId, isEnabled, level, matchAnyKeywords, matchAllKeywords, filterData);
    }
}

tommcdon commented 3 months ago

@hoyosjs

mdh1418 commented 3 months ago

Hi @walterlv and @lindexi,

We haven't been able to repro the exact issue from your repros yet, but the SIGSEGV for the EventPipeEventProvider callback looks eerily similar to https://github.com/dotnet/runtime/issues/80666#issuecomment-2249343314, where the _gchandle used in the callback had been freed before the callback completes.

If the dotnet debugger is hitting the same EventPipeEventProvider Callback issue, then there is a partial fix already merged through https://github.com/dotnet/runtime/pull/106040 and a second PR https://github.com/dotnet/runtime/pull/106156 that is open

lindexi commented 3 months ago

@mdh1418 Thank you. What VisualStudio version and dotnet version you use? And do you debug the application run on Linux?

Can I test the daily dotnet version which merged https://github.com/dotnet/runtime/pull/106040 ?

tommcdon commented 3 months ago

What VisualStudio version and dotnet version you use? And do you debug the application run on Linux?

We used the latest version of the C# extension in VS Code

Can I test the daily dotnet version which merged https://github.com/dotnet/runtime/pull/106040 ?

Yes - the daily builds from https://github.com/dotnet/sdk/blob/main/documentation/package-table.md contain the fix.

kkwpsv commented 3 months ago

@tommcdon I test again with https://aka.ms/dotnet/9.0.1xx/daily/dotnet-sdk-linux-x64.tar.gz. There is no SIGSEV now. The process still exits with SIGTRAP.

I debugged it with lldb. Here's the output:

jwilliamsonveeam commented 1 month ago

Seems like the same problem I'm seeing here: https://github.com/microsoft/DockerTools/issues/444

lindexi commented 1 month ago

@jwilliamsonveeam Sorry, the https://github.com/microsoft/DockerTools/issues/444 is too long, I'm afraid I'm missing out on important information.

jwilliamsonveeam commented 1 month ago

@lindexi I updated my last comment with a small self contained example of a program that fails with a sigtrap in the native c code callback. https://github.com/microsoft/DockerTools/issues/444#issuecomment-2380066894 and a zip of the whole solution is in this thread if you have access. https://developercommunity.visualstudio.com/t/dotnet-process-silently-crashes-when-deb/10740222?

Alxe commented 1 month ago

I've run @walterlv's reproducer (Walterlv.Issues.TraceBreakpointTrap) and reproduced the issue as well.

I've been debugging a similar issue where the scenario is as follows:

A C# callback (annotated with UnmanagedFunctionPointer) is sent to a C function through P/Invoke (annotated with DllImport).
The C code is run in a thread distinct from the one that installed the C# callback.
If the debugger is attached when the C# callback is executed for the first time, the application crashed with a SIGTRAP.
If the debugger is attached after the C# callback has been executed once, the application works correctly.

Using @walterlv's reproducer as a base, I've modified it with these changes and managed to avoid the crash. The output from my execution is as follows:

$ ./artifacts/bin/Walterlv.Issues.TraceBreakpointTrap/debug/TraceBreakpointTrapDemo --skip-attach
### Trace/Breakpoint Trap issue on .NET debugger ###

Context state changed: 1
If you want to debug this demo using other debuggers (e.g. GDB, LLDB), you can use the following options:

  --sleep <seconds>  Sleep for a while before attaching debugger.
  --skip-attach      Skip attaching debugger and run directly.

Please attach a dotnet debugger and use 'Set next statement'.
Context state changed: 2
Context state changed: 3
Context state changed: 4
Context state changed: 5
Issue may not be reproduced. Exit.

In the output, changes 1 to 4 are from before the debugger is attached. Once the debug is attached, change 5 is printed but there's no crash.

Additionally, in my own (non-shareable) projects, I've been able to use a C debugger (lldb or gdb) to manually call the callback (through a function pointer) directly from the debugger. This led to the C# application throwing the following error:

Fatal error. Invalid Program: attempted to call a UnmanagedCallersOnly method from managed code.

This error is seemingly thrown here, but I don't have a fine understanding of the dotnet runtime. However, it leads me to believe that the key is that there are two distinct threads.

janvorli commented 1 month ago

If the debugger is attached when the C# callback is executed for the first time, the application crashed with a SIGTRAP.

If the debugger is attached after the C# callback has been executed once, the application works correctly.

I think this may have revealed the culprit. The thing is that .NET runtime only handles signals when the thread those occurred on are known to the runtime. That means that they were either created by the runtime or called into the runtime. If the debugger sets the breakpoint on the UnmanagedCallersOnly marked method before it calls into the runtime and registers the thread as one that runs managed code, the SIGTRAP would not call the handler in the runtime and it would invoke the default signal handler that terminates the process.

This error is seemingly thrown here

This code is for NativeAOT, in coreclr, the error comes from here: https://github.com/dotnet/runtime/blob/008ee9f84f167cee8d07e086086e1cec724750d5/src/coreclr/vm/dllimportcallback.cpp#L187-L196

Alxe commented 1 month ago

@janvorli Hello and thanks for your input!

I'll be reviewing the ReversePInvokeBadTransition function, as I think I already added a native breakpoint there (it's a extern "C" function) and was able to hit it once.

However, I'd like to point out that the yet-unregistered thread is receiving a SIGTRAP regardless of whether I had a .NET breakpoint or not. Is there anything relevant that the debugger could be doing on thread registration? Could you share some links to code?

jwilliamsonveeam commented 1 month ago

https://github.com/jwilliamsonveeam/TimerCallBackDemo I created a repo with my failing case. I also do not need any breakpoints in order for this to fail with a SIGTRAP with the debugger attached.

janvorli commented 1 month ago

The debugger can set some breakpoints on its own for its internal purposes. @tommcdon would most likely know if it can be the case here.

Alxe commented 1 month ago

@janvorli If the debugger is setting its own breakpoint (e.g. on managed-to-unmanaged transitions) and then reaching it before the thread is properly registered with the .NET runtime (e.g. on the first .NET interaction of a thread), then the SIGTRAP and subsequent crash would make sense.

@tommcdon Could you please confirm if my assumption is correct?

dotnet / runtime