dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.83k stars 4.62k forks source link

App crashes with an output "Trace/Breakpoint Trap" on Linux when a P/Invoke callback is called from a native library if the dotnet debugger is attached. #104459

Open walterlv opened 1 month ago

walterlv commented 1 month ago

Description

  1. Write a .NET 8 application that calls a native library using P/Invoke with a callback.
  2. Run the app, then attach the dotnet debugger before the callback is called.
  3. We'll see an output "Trace/Breakpoint Trap" and the app crashes.

Note: Not all native callbacks cause this issue so I've written a minimal reproducible example below.

Reproduction Steps

Minimal reproducible example 1:

  1. Clone this repo: https://github.com/walterlv/Walterlv.Issues.TraceBreakpointTrap
  2. build the demo to a linux machine
  3. Run the app, then attach the dotnet debugger.
dotnet publish -c debug -r linux-x64 --self-contained
$ ./TraceBreakpointTrapDemo
### Trace/Breakpoint Trap issue on .NET debugger ###
Please attach a dotnet debugger and use 'Set next statement'.
Trace/breakpoint trap

Reproducible example 2:

Expected behavior

The app should not crash when the dotnet debugger is attached.

Actual behavior

The app crashes with an output "Trace/Breakpoint Trap".

Regression?

I've only tested this on .NET 8.0.302

Known Workarounds

I've found several workarounds:

  1. Detect if the debugger is attached and don't call the callback.
  2. Use the "Native (GDB)" or "Native (LLDB)" debugger instead of the "Managed (.NET Core for Unix)" debugger.

Note:

Configuration

I didn't find any environment that doesn't have this issue.

Other information

  1. dotnet tool install -g dotnet-sos
  2. dotnet sos install
  3. ulimit -c unlimited
  4. Run echo "0x3F"> /proc/<pid>/coredump_filter after the process starts and the pid is known.
  5. Attach the debugger and wait for the output Trace/Breakpoint Trap (core dumped).
  6. lldb --core core TraceBreakpointTrapDemo
$ lldb --core core TraceBreakpointTrapDemo
SOS_HOSTING: Failed to find runtime directory
Unrecognized command 'setsymbolserver' because managed hosting failed or was disabled. See sethostruntime command for details.
(lldb) target create "TraceBreakpointTrapDemo" --core "core"
Core file '/home/uos/lvyi/Walterlv.Issue.TraceBreakpointTrap/core' (x86_64) was loaded.
(lldb) clrstack
OS Thread Id: 0x7ef9 (1)
        Child SP               IP Call Site
00007F4AF37DBA38 00007F4AF45F3B41 Walterlv.Issues.TraceBreakpointTrap.VolumeManager.ContextStateCallback(IntPtr, IntPtr)
(lldb) bt
* thread #1, name = 'TraceBreakpoint', stop reason = signal SIGTRAP
  * frame #0: 0x00007f4af45f3b41
    frame #1: 0x00007f4b6ba904f9 libpulse.so.0`___lldb_unnamed_symbol12$$libpulse.so.0 + 73
    frame #2: 0x00007f4b6ba93002 libpulse.so.0`___lldb_unnamed_symbol28$$libpulse.so.0 + 514
    frame #3: 0x00007f4b6ba931d2 libpulse.so.0`___lldb_unnamed_symbol29$$libpulse.so.0 + 98
    frame #4: 0x00007f4b6ba459b2 libpulsecommon-14.2.so`___lldb_unnamed_symbol101$$libpulsecommon-14.2.so + 258
    frame #5: 0x00007f4b6baa63c0 libpulse.so.0`pa_mainloop_dispatch + 672
    frame #6: 0x00007f4b6baa65cc libpulse.so.0`pa_mainloop_iterate + 60
    frame #7: 0x00007f4b6baa6670 libpulse.so.0`pa_mainloop_run + 32
    frame #8: 0x00007f4b6bab43f9 libpulse.so.0`___lldb_unnamed_symbol111$$libpulse.so.0 + 105
    frame #9: 0x00007f4b6ba51628 libpulsecommon-14.2.so`___lldb_unnamed_symbol119$$libpulsecommon-14.2.so + 88
    frame #10: 0x00007f4b73452fa3 libpthread.so.0`start_thread(arg=<unavailable>) at pthread_create.c:486
    frame #11: 0x00007f4b7305d60f libc.so.6`__GI___clone at clone.S:95
(lldb) dis
->  0x7f4af45f3b41: subq   $0x20, %rsp
    0x7f4af45f3b45: leaq   0x20(%rsp), %rbp
    0x7f4af45f3b4a: movq   %rdi, -0x8(%rbp)
    0x7f4af45f3b4e: movq   %rsi, -0x10(%rbp)
    0x7f4af45f3b52: movq   %rdx, -0x18(%rbp)
    0x7f4af45f3b56: cmpl   $0x0, 0x897d3(%rip)
    0x7f4af45f3b5d: je     0x7f4af45f3b64
(lldb) 
dotnet-policy-service[bot] commented 1 month ago

Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.

tommcdon commented 1 month ago

Hi @walterlv! Thanks for reporting this bug!

I didn't find any environment that doesn't have this issue.

Do you know if this issue reproduces on Windows?

tommcdon commented 1 month ago

Do you know if this issue reproduces on Windows?

Ahh nevermind this question as the repro is very specific to linux.

Do you know if the callback/debugging issue is specific to the libpulse API (e.g. does a standalone repo that uses callback from C++ to C# on Linux reproduce the issue)? I am curious if there is something specific to libpulse that is causing the problem, for example a difference in calling convention, etc...

lindexi commented 1 month ago

@tommcdon I can repro this issues by @walterlv 's repo in my linux system. And I can sure it's not the libpulse bug, because I can repro this issues with https://github.com/Haltroy/CefGlue


I can not reproduce on Windows because I fail to run the libpulse on Windows... I mean I do not know if it can be reproduced on Windows.

tommcdon commented 1 month ago

Possible duplicate to https://github.com/dotnet/runtime/issues/102767. @hoyosjs

walterlv commented 1 month ago

Thanks to my friend @kkwpsv, he helped me to find out more information about this issue.

@tommcdon This issue is quite different from #102767:

  1. This issue is related to the dotnet debugger on linux (and only on linux).
  2. This issue might not related to the callback but I can't figure out whether it is or not.

Let's see more details here.

  1. Debug run the app using a dotnet debugger (I was using the JetBrains Rider linux version) and let the app stops at a breakpoint.
  2. Attach lldb to the running process.
  3. Continue the app in the dotnet debugger.
  4. Continue the app in the lldb debugger.

Then,

  1. See all the threads in the lldb debugger using thread backtrace all and we that thread 3 .NET EventPipe is stopped with signal SIGTRAP
  2. Resume the app and the thread 3 receives a detail signal signal SIGSEGV: address not mapped to object (fault address: 0xbafa13a0).

The stack traces are shown as follows:

image

image

[UnmanagedCallersOnly]
private static unsafe void Callback(byte* sourceId, int isEnabled, byte level,
    long matchAnyKeywords, long matchAllKeywords, Interop.Advapi32.EVENT_FILTER_DESCRIPTOR* filterData, void* callbackContext)
{
    EventPipeEventProvider _this = (EventPipeEventProvider)GCHandle.FromIntPtr((IntPtr)callbackContext).Target!;
    if (_this._eventProvider.TryGetTarget(out EventProvider? target))
    {
        _this.ProviderCallback(target, sourceId, isEnabled, level, matchAnyKeywords, matchAllKeywords, filterData);
    }
}
tommcdon commented 1 month ago

@hoyosjs

mdh1418 commented 2 weeks ago

Hi @walterlv and @lindexi,

We haven't been able to repro the exact issue from your repros yet, but the SIGSEGV for the EventPipeEventProvider callback looks eerily similar to https://github.com/dotnet/runtime/issues/80666#issuecomment-2249343314, where the _gchandle used in the callback had been freed before the callback completes.

If the dotnet debugger is hitting the same EventPipeEventProvider Callback issue, then there is a partial fix already merged through https://github.com/dotnet/runtime/pull/106040 and a second PR https://github.com/dotnet/runtime/pull/106156 that is open

lindexi commented 2 weeks ago

@mdh1418 Thank you. What VisualStudio version and dotnet version you use? And do you debug the application run on Linux?

Can I test the daily dotnet version which merged https://github.com/dotnet/runtime/pull/106040 ?

tommcdon commented 2 weeks ago

What VisualStudio version and dotnet version you use? And do you debug the application run on Linux?

We used the latest version of the C# extension in VS Code

Can I test the daily dotnet version which merged https://github.com/dotnet/runtime/pull/106040 ?

Yes - the daily builds from https://github.com/dotnet/sdk/blob/main/documentation/package-table.md contain the fix.

kkwpsv commented 1 week ago

@tommcdon I test again with https://aka.ms/dotnet/9.0.1xx/daily/dotnet-sdk-linux-x64.tar.gz. There is no SIGSEV now. The process still exits with SIGTRAP.

I debugged it with lldb. Here's the output: image