dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.95k stars 4.65k forks source link

[Linux, amd64] ICorDebugEval::Abort are broken. #82422

Open viewizard opened 1 year ago

viewizard commented 1 year ago

Description

During netcoredbg testing with .NET SDK 7.0 (dotnet-sdk-7.0.200-linux-x64.tar.gz). We found, that tests related to evaluation abort (timeout during execution and System.Diagnostics.Debugger.NotifyOfCrossThreadDependency() usage in evaluated code) are failed. We found, that ICorDebugEval::Abort and ICorDebugEval2::RudeAbort return S_OK at call, but don't really abort evaluation in 7.0 (we also tested latest .NET SDK 3.1, 5.0, 6.0 and don't found any issues).

Reproduction Steps

Call ICorDebugEval::Abort in debugger. For example, create property with getter with System.Diagnostics.Debugger.NotifyOfCrossThreadDependency() and evaluate it. In this case runtime will call ICorDebugManagedCallback3::CustomNotification that forced debugger to use ICorDebugEval::Abort (CoreCLR debug API don't really provide any other way you could abort code execution during evaluation).

Expected behavior

  1. During evaluation call ICorDebugEval::Abort or ICorDebugEval2::RudeAbort, it will return S_OK.
  2. Evaluated code don't continue execution.
  3. ICorDebugManagedCallback::EvalComplete or ICorDebugManagedCallback::EvalException called, ICorDebugEval::GetResult() return CORDBG_S_FUNC_EVAL_ABORTED.

Actual behavior

  1. During evaluation call ICorDebugEval::Abort or ICorDebugEval2::RudeAbort, it will return S_OK.
  2. Evaluated code continue execution.
  3. After evaluated code execution ends ICorDebugManagedCallback::EvalComplete or ICorDebugManagedCallback::EvalException called, ICorDebugEval::GetResult() return S_OK. In case evaluated code never ends (for example, about deadlock), ICorDebugManagedCallback::EvalComplete or ICorDebugManagedCallback::EvalException never called.

Regression?

It works fine on Linux amd64 with .NET SDK 3.1, 5.0 and 6.0.

Known Workarounds

No response

Configuration

Ubuntu 20.04 amd64, .NET SDK 7.0 (dotnet-sdk-7.0.200-linux-x64.tar.gz).

Other information

No response

viewizard commented 1 year ago

CC @alpencolt

ghost commented 1 year ago

Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.

Issue Details
### Description During netcoredbg testing with .NET SDK 7.0 (dotnet-sdk-7.0.200-linux-x64.tar.gz). We found, that tests related to evaluation abort (timeout during execution and `System.Diagnostics.Debugger.NotifyOfCrossThreadDependency()` usage in evaluated code) are failed. We found, that `ICorDebugEval::Abort` and `ICorDebugEval2::RudeAbort` return `S_OK` at call, but don't really abort evaluation in 7.0 (we also tested latest .NET SDK 3.1, 5.0, 6.0 and don't found any issues). ### Reproduction Steps Call `ICorDebugEval::Abort` in debugger. For example, create property with getter with `System.Diagnostics.Debugger.NotifyOfCrossThreadDependency()` and evaluate it. In this case runtime will call `ICorDebugManagedCallback3::CustomNotification` that forced debugger to use `ICorDebugEval::Abort` (CoreCLR debug API don't really provide any other way you could abort code execution during evaluation). ### Expected behavior 1. During evaluation call `ICorDebugEval::Abort` or `ICorDebugEval2::RudeAbort`, it will return `S_OK`. 2. Evaluated code don't continue execution. 3. `ICorDebugManagedCallback::EvalComplete` or `ICorDebugManagedCallback::EvalException` called. ### Actual behavior 1. During evaluation call `ICorDebugEval::Abort` or `ICorDebugEval2::RudeAbort`, it will return `S_OK`. 2. Evaluated code continue execution. 3. After evaluated code execution ends `ICorDebugManagedCallback::EvalComplete` or `ICorDebugManagedCallback::EvalException` called. In case evaluated code never ends (for example, about deadlock), `ICorDebugManagedCallback::EvalComplete` or `ICorDebugManagedCallback::EvalException` never called. ### Regression? It works fine on Linux amd64 with .NET SDK 3.1, 5.0 and 6.0. ### Known Workarounds _No response_ ### Configuration Ubuntu 20.04 amd64, .NET SDK 7.0 (dotnet-sdk-7.0.200-linux-x64.tar.gz). ### Other information _No response_
Author: viewizard
Assignees: -
Labels: `area-Diagnostics-coreclr`, `untriaged`
Milestone: -
gbalykov commented 1 year ago

cc @jkotas

gbalykov commented 1 year ago

@jkotas did you have a chance to take a look?

jkotas commented 1 year ago

Looks like a regression, it may be related to #71661. @tommcdon Do you see this problem in VS debugger tests?

tommcdon commented 1 year ago

This may be related to https://github.com/dotnet/runtime/pull/72705 and if 'yes' it could be a duplicate to https://github.com/dotnet/runtime/issues/72703. I am not seeing any issues in our Visual Studio func-eval abort testing on .NET 7 or .NET 8 running against Linux and macOS.
@viewizard would it be possible to share a reproducible test case of the debuggee function failing to abort?

viewizard commented 1 year ago

I just figure out, that in case some simple test all works fine on Linux amd64 with VSCode debugger and our debugger too. For example, simple test like:

using System;

namespace test
{
    class Program
    {
        static int eval_test()
        {
            //System.Diagnostics.Debugger.NotifyOfCrossThreadDependency();
            System.Threading.Thread.Sleep(20000);
            return 5;
        }

        static void Main()
        {
            int i = 0;
            i ++;
        }
    }
}

with evaluation for eval_test() at line i++ will properly abort since timed out. In case of our debugger, it will also abort without any issues in case NotifyOfCrossThreadDependency will uncomment.

But it fails on our complex eval tests we do at lines (note, Label.Checkpoint blocks is parts related of control program, that interact with debugger, other code - debuggee program related code): https://github.com/Samsung/netcoredbg/blob/master/test-suite/MITestVariables/Program.cs#L1197 https://github.com/Samsung/netcoredbg/blob/master/test-suite/MITestEvaluate/Program.cs#L1453 https://github.com/Samsung/netcoredbg/blob/master/test-suite/VSCodeTestVariables/Program.cs#L1315 https://github.com/Samsung/netcoredbg/blob/master/test-suite/VSCodeTestEvaluate/Program.cs#L1577 all the time, Abort() return S_OK: https://github.com/Samsung/netcoredbg/blob/fb63fd9a834a3f640bc573827bf50c142ee5039b/src/debugger/evalwaiter.cpp#L241 https://github.com/Samsung/netcoredbg/blob/fb63fd9a834a3f640bc573827bf50c142ee5039b/src/debugger/evalwaiter.cpp#L183

as you could see we do a lot of evals before abort-related test, probably this is related (unfortunately I was not able to reproduce this with some simple code in VSCode). If you would like, I could provide instruction how you could use our debugger on Linux in order to run this tests.

tommcdon commented 1 year ago

Thanks for taking a look @viewizard! Would you mind explaining what GetAndCheckValue and CheckErrorAtRequest does? I'm guessing it performs a func-eval. So if I am reading the code correctly, it seems the order events are:

  1. func-eval stGetInt()
  2. func-eval stGetInt(333)
  3. func-eval getInt()
  4. func-eval TestTimeOut()

Is this the TestTimeOut() function the one that fails to Abort()? It seems it sleeps for 10 seconds then returns. How long does the test wait for the abort to occur?

https://github.com/Samsung/netcoredbg/blob/db69338cf1606d8d327de0eaaf1694d8463af135/test-suite/VSCodeTestEvaluate/Program.cs#L801-L805

If we selectively comment out different parts of the test, does the Abort eventually begin to succeed? If 'yes' then I suggest systematically un-commenting test code until we find the previous step that causes the Abort to fail. It may be that something before the func-eval abort is causing the debugger or the runtime to be destabilized and leads to the func-eval abort failure.

viewizard commented 1 year ago

Yes, GetAndCheckValue and CheckErrorAtRequest send to debugger EvaluateRequest and wait for reply, for GetAndCheckValue we compare it with ExpectedResult and ExpectedType.

it seems the order events are: func-eval stGetInt() func-eval stGetInt(333) func-eval getInt() func-eval TestTimeOut()

You absolutely right, but I was not able to reproduce abort-related issue with only this evals tests, this looks strange for me, since we do only this tests on current breakpoint stop.

Is this the TestTimeOut() function the one that fails to Abort()? It seems it sleeps for 10 seconds then returns. How long does the test wait for the abort to occur?

debugger will call Abort() after 5 sec - https://github.com/Samsung/netcoredbg/blob/fb63fd9a834a3f640bc573827bf50c142ee5039b/src/debugger/evalwaiter.cpp#L167 but Abort() do nothing and eval-related code continue executing, debugger wait for eval ends here: https://github.com/Samsung/netcoredbg/blob/fb63fd9a834a3f640bc573827bf50c142ee5039b/src/debugger/evalwaiter.cpp#L193 In case VSCodeTestEvaluate, after 10 seconds (System.Threading.Thread.Sleep(10000); ends) eval will complete (ICorDebugManagedCallback::EvalComplete() will be called with result) and debugger provide "result" = 5, that should not happens (should ends without result).

If we selectively comment out different parts of the test, does the Abort eventually begin to succeed? If 'yes' then I suggest systematically un-commenting test code until we find the previous step that causes the Abort to fail. It may be that something before the func-eval abort is causing the debugger or the runtime to be destabilized and leads to the func-eval abort failure.

Sure, I will try in monday find some related eval block.

viewizard commented 1 year ago

@tommcdon I am sorry, looks like in friday I was not careful and did a mistake in testing and analyze. I used same simple test code again:

using System;

namespace test
{
    class Program
    {
        static int TestTimeOut()
        {
            System.Threading.Thread.Sleep(10000);
            return 5;
        }

        static void Main()
        {
            System.Console.WriteLine("Hello World!");
        }
    }
}

Test sequence: 1) Set and stop at breakpoint in line System.Console.WriteLine("Hello World!"); 2) Evaluate TestTimeOut();

After 5 sec. of eval execution (at this point execution inside System.Threading.Thread.Sleep(10000) code) 1) debugger call ICorDebugController::Stop(), that return S_OK; 2) debugger call ICorDebugEval::Abort(), that return S_OK; 3) debugger call ICorDebugController::Continue(), that return S_OK (at this point we run all managed threads for sure); Now I counted seconds and looks like only Sleep() was aborted, not all evaluation code, next code execution starts immediately, not after 5 seconds 4) ICorDebugManagedCallback::EvalComplete() callback called; 5) debugger check eval status ICorDebugEval::GetResult() and get S_OK; 6) eval result (ICorDebugValue), same as should be in case abort never called;

After that I changed code of TestTimeOut() method to:

        static int TestTimeOut()
        {
            System.Threading.Thread.Sleep(10000);
            System.Threading.Thread.Sleep(10000);
            return 5;
        }

And abort now works just as should:

After 5 sec. of eval execution (at this point execution inside System.Threading.Thread.Sleep(10000) code) 1) debugger call ICorDebugController::Stop(), that return S_OK; 2) debugger call ICorDebugEval::Abort(), that return S_OK; 3) debugger call ICorDebugController::Continue(), that return S_OK (at this point we run all managed threads for sure); 4) ICorDebugManagedCallback::EvalComplete() callback called; 5) debugger check eval status ICorDebugEval::GetResult() and get CORDBG_S_FUNC_EVAL_ABORTED.

I also play around TestTimeOut() code changes and found, that eval abort broken with some simple changes like

        static int TestTimeOut()
        {
            System.Threading.Thread.Sleep(10000);
            int i = 5;
            return 5;
        }

But in the same time abort works proper with simple code moved to separate method:

        static void TestMethod()
        {
            int i = 5;
        }

        static int TestTimeOut()
        {
            System.Threading.Thread.Sleep(10000);
            TestMethod();
            return 5;
        }

Hope this helps.

tommcdon commented 1 year ago

Thanks for the details @viewizard! Adding @hoyosjs. I believe that we have no way of aborting the native call to Sleep which is a native call into the OS, and so my guess is that we trap the return back into the method and then execute the abort when the instruction pointer re-enters the runtime (when the sleep returns). My guess is that in the scenario where there is more code after the first call to sleep, the abort prevents further execution in the method. However, when there is only one call to sleep, the func-eval "successfully" completes because there is nothing to abort. This is just a guess at this point and would need investigation to confirm.

viewizard commented 1 year ago

But the point is - previously (with .NET SDK 3.1, 5.0, 6,0) all works just fine, I mean, we have same tests for all .NET SDK releases (3.1, 5.0, 6.0) with our debugger (we use same debugger binary + latest dbgshim for tests with different .NET SDK) and it fail now only for 7.0 due to abort issue. Is Sleep() code was changed in 7.0?

viewizard commented 1 year ago

I also found one more issue with abort in 7.0. In case Sleep() is huge, for examle:

        static int TestTimeOut()
        {
            System.Threading.Thread.Sleep(6000000);
            return 5;
        }

After 5 sec. of eval execution (at this point execution inside System.Threading.Thread.Sleep(6000000) code)

  1. debugger call ICorDebugController::Stop(), that return S_OK;
  2. debugger call ICorDebugEval::Abort(), that return S_OK;
  3. debugger call ICorDebugController::Continue(), that return S_OK (at this point we run all managed threads for sure);
  4. UPD2: related to eval ICorDebugManagedCallback::EvalComplete() callback was called only after Sleep() is fully complited, and return with "result" = 5 (ICorDebugEval::GetResult() is S_OK), this probably mean abort don't work at all for this case. This is really sad.

I am still worry, why Sleep(6000000) and Sleep(10000) works in so different way.

Works fine in 6.0.

jkotas commented 1 year ago

I believe that we have no way of aborting the native call to Sleep which is a native call into the OS,

Thread.Sleep is abortable on CoreCLR. On Windows, the abort is achieved by sending APC to the sleeping thread. On Unix, the PAL has Windows APC emulator to do the same. Relevant code: https://github.com/dotnet/runtime/blob/766d8c184693eea3b1d72f4de672e0f40d2085ea/src/coreclr/vm/threads.cpp#L4177-L4207 .

tommcdon commented 1 year ago

Thanks @jkotas and @viewizard!! It sounds like we should try an experiment where we add/enable APC and debugger logging to see 1) is the call to ICorDebugController::Stop actually successful (I am curious if we are still stuck in a native call to sleep or is the process synchronized), 2) is there an APC cancellation request to ClrSleepEx, 3) if 'yes', is it actually being aborted. @viewizard would you be interested in giving that a try?

viewizard commented 1 year ago

Sorry, have no time for deep CoreCLR 7.0 investigation now, must care about next netcoredbg releases in Tizen SDK (as you probably know, last Tizen use CoreCLR 6.0, we only added 7.0 to our CI for debugger testing and share results and fast analyze with you).

But, probably @gbalykov and @alpencolt have more info about our future plans.

gbalykov commented 1 year ago

@tommcdon we can try to run our debugger with patched runtime with enabled logging if you share such patch.

tommcdon commented 1 year ago

We believe this issue might be fixed by https://github.com/dotnet/runtime/pull/89781. @viewizard Would you mind testing the fix? I will be included with .NET 8 RC1, or can be tested against a build of main containing the fix.

ghost commented 1 year ago

This issue has been marked needs-author-action and may be missing some important information.

ghost commented 1 year ago

This issue has been automatically marked no-recent-activity because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity.

ghost commented 1 year ago

This issue will now be closed since it had been marked no-recent-activity but received no further activity in the past 14 days. It is still possible to reopen or comment on the issue, but please note that the issue will be locked if it remains inactive for another 30 days.

gbalykov commented 1 year ago

We'll retest when .net 8 rc1 is released, please keep this open for now

janvorli commented 1 year ago

I've reopened it as per @gbalykov's request.

gbalykov commented 1 year ago

@janvorli Thank you!

viewizard commented 11 months ago

@tommcdon I tested SDK 8.0.100-rc.1 from https://dotnet.microsoft.com/en-us/download/dotnet/8.0 on Linux amd64, but issue still here.

noahfalk commented 4 months ago

This issue (or something very similar) also appears to reproduce on VS when targetting WSL. It does not repro when the same steps run on Windows. My repro:

  1. Write console app:
namespace Idle
{
    internal class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Hello, World!");
        }

        static int Sleep()
        {
            Thread.Sleep(100000);
            return 5;
        }
    }
}
  1. Set breakpoint on the opening curly brace of Main()
  2. Run the app under the debugger targetting WSL
  3. When the breakpoint is hit, type "Sleep()" into the watch window and hit enter

Observed results

  1. After 5 seconds a dialog appears warning the operation is taking a long time
  2. After 10 seconds a dialog appears warning the operation timed out:

image

  1. I clicked OK and after the dialog disappeared I clicked 'Step Over'. VS goes back into run mode.
  2. After ~100 seconds the step completes and VS highlights the Sleep() line of code: image
  3. Continuing to step from there worked quickly and stepping past the end of the Sleep() function caused the app to run to completion.

Expected results (Also the observed results when targetting Windows)

After 5 seconds the watch window updates to say "Evaluation timed out" and the yellow arrow again points to the opening curly brace of Main(). Stepping works as expected and immediately advances to the next line.

I did some diagnosis with the stress log and there is pretty good evidence that on Windows the Sleep() function was interrupted as expected but on Linux it was not.

Windows stress log excerpt:

STRESS LOG:
    facilitiesToLog  = 0x82004000
    levelToLog       = 10
    MaxLogSizePerThread = 0x20000 (131072)
    MaxTotalLogSize = 0x2000000 (33554432)
    CurrentTotalLogChunk = 16
    ThreadsWithLogs  = 13
    Clock frequency  = 0.010 GHz
    Start time         15:56:44
    Last message time  15:56:58
    Total elapsed time 13.628 sec

    <lots of messages omitted>

    2db0  13.608285200 : `APPDOMAIN`          in Thread::ReadyForAbort  PreparingAbort

2db0  13.608254200 : `APPDOMAIN`          Thread::HandleThreadAbort throwing abort for 1

2db0  13.608253400 : `EH`                 TAStackCrawlCallBack: STACKCRAWL method:00007FF9DD3E00C0 (Idle.Program.Main(System.String[])) ('<n/a>'), offset 23, Frame:0000000000000000, FrameVtable = 0000000000000000 (Unknown VTable)

2db0  13.608250600 : `EH`                 TAStackCrawlCallBack: STACKCRAWL method:00007FF9DD3E00D8 (Idle.Program.Sleep()) ('<n/a>'), offset 29, Frame:0000000000000000, FrameVtable = 0000000000000000 (Unknown VTable)

2db0  13.608248600 : `EH`                 TAStackCrawlCallBack: STACKCRAWL method:00007FF9DD3B15F0 (System.Threading.Thread.Sleep(Int32)) ('<n/a>'), offset 11, Frame:0000000000000000, FrameVtable = 0000000000000000 (Unknown VTable)

2db0  13.608242700 : `EH`                 TAStackCrawlCallBack: STACKCRAWL method:00007FF9DD3B1068 (System.Threading.Thread.SleepInternal(Int32)) ('<n/a>'), offset 0, Frame:000000A1E677E1B8, FrameVtable = 00007FFA3D129E38 (coreclr!HelperMethodFrame::`vftable')

938c  13.593637800 : `EH`                 TAStackCrawlCallBack: STACKCRAWL method:00007FF9DD3E00C0 (Idle.Program.Main(System.String[])) ('<n/a>'), offset 23, Frame:0000000000000000, FrameVtable = 0000000000000000 (Unknown VTable)

938c  13.593634600 : `EH`                 TAStackCrawlCallBack: STACKCRAWL method:00007FF9DD3E00D8 (Idle.Program.Sleep()) ('<n/a>'), offset 29, Frame:0000000000000000, FrameVtable = 0000000000000000 (Unknown VTable)

938c  13.593632900 : `EH`                 TAStackCrawlCallBack: STACKCRAWL method:00007FF9DD3B15F0 (System.Threading.Thread.Sleep(Int32)) ('<n/a>'), offset 11, Frame:0000000000000000, FrameVtable = 0000000000000000 (Unknown VTable)

938c  13.593629700 : `EH`                 TAStackCrawlCallBack: STACKCRAWL method:00007FF9DD3B1068 (System.Threading.Thread.SleepInternal(Int32)) ('<n/a>'), offset 0, Frame:000000A1E677E1B8, FrameVtable = 00007FFA3D129E38 (coreclr!HelperMethodFrame::`vftable')

938c  13.593292800 : `APPDOMAIN`          Mark Thread 000002341BD57780 Thread Id = 1 for abort (type 1)

938c  13.593273800 : `SYNC`APPDOMAIN`     UserAbort Thread 000002341BD57780 Thread Id = 1

Linux stress log excerpt

STRESS LOG:
    facilitiesToLog  = 0x82004000
    levelToLog       = 10
    MaxLogSizePerThread = 0x20000 (131072)
    MaxTotalLogSize = 0x2000000 (33554432)
    CurrentTotalLogChunk = 12
    ThreadsWithLogs  = 9
    Clock frequency  = 1.000 GHz
    Start time             Last message time      Total elapsed time 22.962 sec

    THREAD  TIMESTAMP     FACILITY                              MESSAGE
  ID  (sec from start)
--------------------------------------------------------------------------------------
 e04  22.961774767 : `STARTUP`            ******* DllMain THREAD_DETACH called Thread dying *******

 da4  21.754708632 : `APPDOMAIN`          Mark Thread 0x55d9ccff98e0 Thread Id = 1 for abort (type 2)

 da4  21.754708132 : `SYNC`APPDOMAIN`     UserAbort Thread 0x55d9ccff98e0 Thread Id = 1

 e05  18.761860847 : `ALWAYS`             SetupThread  managed Thread 0x7f8ff000a1a0 Thread Id = 8

------------ Last message from thread e05 -----------
 e04  18.754119146 : `ALWAYS`             SetupThread  managed Thread 0x7f8fec009d20 Thread Id = 3

------------ Last message from thread e04 -----------
 da4  15.243784346 : `APPDOMAIN`          Mark Thread 0x55d9ccff98e0 Thread Id = 1 for abort (type 1)

 da4  15.243782046 : `SYNC`APPDOMAIN`     UserAbort Thread 0x55d9ccff98e0 Thread Id = 1

Comparison

On Windows we can see the call to "Mark Thread for Abort" was quickly followed by Thread::HandleAbort() running inside the thread running Sleep() because the sleep was interrupted. On Linux there is no indication that interrupt occured, after an additional 5 seconds the debugger attempted a rude abort (type 2 abort) which also produced no result.

Based on the stepping delay observed experimentally the 100 second Sleep() never stopped on Linux, VS just gave up waiting on it to complete when the dialog was displayed. It certainly looks like we've got a reproable issue where Sleep() was expected to be interrupted on Linux but wasn't. Next step would be more investigation of the interrupt mechanism on Linux to determine why it isn't triggering in this case.