Open viewizard opened 1 year ago
CC @alpencolt
Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.
Author: | viewizard |
---|---|
Assignees: | - |
Labels: | `area-Diagnostics-coreclr`, `untriaged` |
Milestone: | - |
cc @jkotas
@jkotas did you have a chance to take a look?
Looks like a regression, it may be related to #71661. @tommcdon Do you see this problem in VS debugger tests?
This may be related to https://github.com/dotnet/runtime/pull/72705 and if 'yes' it could be a duplicate to https://github.com/dotnet/runtime/issues/72703. I am not seeing any issues in our Visual Studio func-eval abort testing on .NET 7 or .NET 8 running against Linux and macOS.
@viewizard would it be possible to share a reproducible test case of the debuggee function failing to abort?
I just figure out, that in case some simple test all works fine on Linux amd64 with VSCode debugger and our debugger too. For example, simple test like:
using System;
namespace test
{
class Program
{
static int eval_test()
{
//System.Diagnostics.Debugger.NotifyOfCrossThreadDependency();
System.Threading.Thread.Sleep(20000);
return 5;
}
static void Main()
{
int i = 0;
i ++;
}
}
}
with evaluation for eval_test()
at line i++
will properly abort since timed out.
In case of our debugger, it will also abort without any issues in case NotifyOfCrossThreadDependency
will uncomment.
But it fails on our complex eval tests we do at lines (note, Label.Checkpoint
blocks is parts related of control program, that interact with debugger, other code - debuggee program related code):
https://github.com/Samsung/netcoredbg/blob/master/test-suite/MITestVariables/Program.cs#L1197
https://github.com/Samsung/netcoredbg/blob/master/test-suite/MITestEvaluate/Program.cs#L1453
https://github.com/Samsung/netcoredbg/blob/master/test-suite/VSCodeTestVariables/Program.cs#L1315
https://github.com/Samsung/netcoredbg/blob/master/test-suite/VSCodeTestEvaluate/Program.cs#L1577
all the time, Abort()
return S_OK:
https://github.com/Samsung/netcoredbg/blob/fb63fd9a834a3f640bc573827bf50c142ee5039b/src/debugger/evalwaiter.cpp#L241
https://github.com/Samsung/netcoredbg/blob/fb63fd9a834a3f640bc573827bf50c142ee5039b/src/debugger/evalwaiter.cpp#L183
as you could see we do a lot of evals before abort-related test, probably this is related (unfortunately I was not able to reproduce this with some simple code in VSCode). If you would like, I could provide instruction how you could use our debugger on Linux in order to run this tests.
Thanks for taking a look @viewizard! Would you mind explaining what GetAndCheckValue
and CheckErrorAtRequest
does? I'm guessing it performs a func-eval. So if I am reading the code correctly, it seems the order events are:
stGetInt()
stGetInt(333)
getInt()
TestTimeOut()
Is this the TestTimeOut()
function the one that fails to Abort()? It seems it sleeps for 10 seconds then returns. How long does the test wait for the abort to occur?
If we selectively comment out different parts of the test, does the Abort eventually begin to succeed? If 'yes' then I suggest systematically un-commenting test code until we find the previous step that causes the Abort to fail. It may be that something before the func-eval abort is causing the debugger or the runtime to be destabilized and leads to the func-eval abort failure.
Yes, GetAndCheckValue
and CheckErrorAtRequest
send to debugger EvaluateRequest
and wait for reply, for GetAndCheckValue
we compare it with ExpectedResult
and ExpectedType
.
it seems the order events are: func-eval stGetInt() func-eval stGetInt(333) func-eval getInt() func-eval TestTimeOut()
You absolutely right, but I was not able to reproduce abort-related issue with only this evals tests, this looks strange for me, since we do only this tests on current breakpoint stop.
Is this the TestTimeOut() function the one that fails to Abort()? It seems it sleeps for 10 seconds then returns. How long does the test wait for the abort to occur?
debugger will call Abort()
after 5 sec - https://github.com/Samsung/netcoredbg/blob/fb63fd9a834a3f640bc573827bf50c142ee5039b/src/debugger/evalwaiter.cpp#L167
but Abort()
do nothing and eval-related code continue executing, debugger wait for eval ends here:
https://github.com/Samsung/netcoredbg/blob/fb63fd9a834a3f640bc573827bf50c142ee5039b/src/debugger/evalwaiter.cpp#L193
In case VSCodeTestEvaluate, after 10 seconds (System.Threading.Thread.Sleep(10000);
ends) eval will complete (ICorDebugManagedCallback::EvalComplete()
will be called with result) and debugger provide "result" = 5, that should not happens (should ends without result).
If we selectively comment out different parts of the test, does the Abort eventually begin to succeed? If 'yes' then I suggest systematically un-commenting test code until we find the previous step that causes the Abort to fail. It may be that something before the func-eval abort is causing the debugger or the runtime to be destabilized and leads to the func-eval abort failure.
Sure, I will try in monday find some related eval block.
@tommcdon I am sorry, looks like in friday I was not careful and did a mistake in testing and analyze. I used same simple test code again:
using System;
namespace test
{
class Program
{
static int TestTimeOut()
{
System.Threading.Thread.Sleep(10000);
return 5;
}
static void Main()
{
System.Console.WriteLine("Hello World!");
}
}
}
Test sequence:
1) Set and stop at breakpoint in line System.Console.WriteLine("Hello World!");
2) Evaluate TestTimeOut()
;
After 5 sec. of eval execution (at this point execution inside System.Threading.Thread.Sleep(10000)
code)
1) debugger call ICorDebugController::Stop()
, that return S_OK
;
2) debugger call ICorDebugEval::Abort()
, that return S_OK
;
3) debugger call ICorDebugController::Continue()
, that return S_OK
(at this point we run all managed threads for sure);
Now I counted seconds and looks like only Sleep() was aborted, not all evaluation code, next code execution starts immediately, not after 5 seconds
4) ICorDebugManagedCallback::EvalComplete()
callback called;
5) debugger check eval status ICorDebugEval::GetResult()
and get S_OK
;
6) eval result (ICorDebugValue), same as should be in case abort never called;
After that I changed code of TestTimeOut()
method to:
static int TestTimeOut()
{
System.Threading.Thread.Sleep(10000);
System.Threading.Thread.Sleep(10000);
return 5;
}
And abort now works just as should:
After 5 sec. of eval execution (at this point execution inside System.Threading.Thread.Sleep(10000)
code)
1) debugger call ICorDebugController::Stop()
, that return S_OK
;
2) debugger call ICorDebugEval::Abort()
, that return S_OK
;
3) debugger call ICorDebugController::Continue()
, that return S_OK
(at this point we run all managed threads for sure);
4) ICorDebugManagedCallback::EvalComplete()
callback called;
5) debugger check eval status ICorDebugEval::GetResult()
and get CORDBG_S_FUNC_EVAL_ABORTED
.
I also play around TestTimeOut()
code changes and found, that eval abort broken with some simple changes like
static int TestTimeOut()
{
System.Threading.Thread.Sleep(10000);
int i = 5;
return 5;
}
But in the same time abort works proper with simple code moved to separate method:
static void TestMethod()
{
int i = 5;
}
static int TestTimeOut()
{
System.Threading.Thread.Sleep(10000);
TestMethod();
return 5;
}
Hope this helps.
Thanks for the details @viewizard! Adding @hoyosjs. I believe that we have no way of aborting the native call to Sleep which is a native call into the OS, and so my guess is that we trap the return back into the method and then execute the abort when the instruction pointer re-enters the runtime (when the sleep returns). My guess is that in the scenario where there is more code after the first call to sleep, the abort prevents further execution in the method. However, when there is only one call to sleep, the func-eval "successfully" completes because there is nothing to abort. This is just a guess at this point and would need investigation to confirm.
But the point is - previously (with .NET SDK 3.1, 5.0, 6,0) all works just fine, I mean, we have same tests for all .NET SDK releases (3.1, 5.0, 6.0) with our debugger (we use same debugger binary + latest dbgshim for tests with different .NET SDK) and it fail now only for 7.0 due to abort issue. Is Sleep()
code was changed in 7.0?
I also found one more issue with abort in 7.0. In case Sleep()
is huge, for examle:
static int TestTimeOut()
{
System.Threading.Thread.Sleep(6000000);
return 5;
}
After 5 sec. of eval execution (at this point execution inside System.Threading.Thread.Sleep(6000000) code)
ICorDebugManagedCallback::EvalComplete()
callback was called only after Sleep()
is fully complited, and return with "result" = 5 (ICorDebugEval::GetResult()
is S_OK
), this probably mean abort don't work at all for this case. This is really sad.I am still worry, why Sleep(6000000)
and Sleep(10000)
works in so different way.
Works fine in 6.0.
I believe that we have no way of aborting the native call to Sleep which is a native call into the OS,
Thread.Sleep
is abortable on CoreCLR. On Windows, the abort is achieved by sending APC to the sleeping thread. On Unix, the PAL has Windows APC emulator to do the same. Relevant code: https://github.com/dotnet/runtime/blob/766d8c184693eea3b1d72f4de672e0f40d2085ea/src/coreclr/vm/threads.cpp#L4177-L4207 .
Thanks @jkotas and @viewizard!! It sounds like we should try an experiment where we add/enable APC and debugger logging to see 1) is the call to ICorDebugController::Stop
actually successful (I am curious if we are still stuck in a native call to sleep or is the process synchronized), 2) is there an APC cancellation request to ClrSleepEx, 3) if 'yes', is it actually being aborted. @viewizard would you be interested in giving that a try?
Sorry, have no time for deep CoreCLR 7.0 investigation now, must care about next netcoredbg releases in Tizen SDK (as you probably know, last Tizen use CoreCLR 6.0, we only added 7.0 to our CI for debugger testing and share results and fast analyze with you).
But, probably @gbalykov and @alpencolt have more info about our future plans.
@tommcdon we can try to run our debugger with patched runtime with enabled logging if you share such patch.
We believe this issue might be fixed by https://github.com/dotnet/runtime/pull/89781. @viewizard Would you mind testing the fix? I will be included with .NET 8 RC1, or can be tested against a build of main containing the fix.
This issue has been marked needs-author-action
and may be missing some important information.
This issue has been automatically marked no-recent-activity
because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity
.
This issue will now be closed since it had been marked no-recent-activity
but received no further activity in the past 14 days. It is still possible to reopen or comment on the issue, but please note that the issue will be locked if it remains inactive for another 30 days.
We'll retest when .net 8 rc1 is released, please keep this open for now
I've reopened it as per @gbalykov's request.
@janvorli Thank you!
@tommcdon I tested SDK 8.0.100-rc.1 from https://dotnet.microsoft.com/en-us/download/dotnet/8.0 on Linux amd64, but issue still here.
This issue (or something very similar) also appears to reproduce on VS when targetting WSL. It does not repro when the same steps run on Windows. My repro:
namespace Idle
{
internal class Program
{
static void Main(string[] args)
{
Console.WriteLine("Hello, World!");
}
static int Sleep()
{
Thread.Sleep(100000);
return 5;
}
}
}
Observed results
Expected results (Also the observed results when targetting Windows)
After 5 seconds the watch window updates to say "Evaluation timed out" and the yellow arrow again points to the opening curly brace of Main(). Stepping works as expected and immediately advances to the next line.
I did some diagnosis with the stress log and there is pretty good evidence that on Windows the Sleep() function was interrupted as expected but on Linux it was not.
STRESS LOG:
facilitiesToLog = 0x82004000
levelToLog = 10
MaxLogSizePerThread = 0x20000 (131072)
MaxTotalLogSize = 0x2000000 (33554432)
CurrentTotalLogChunk = 16
ThreadsWithLogs = 13
Clock frequency = 0.010 GHz
Start time 15:56:44
Last message time 15:56:58
Total elapsed time 13.628 sec
<lots of messages omitted>
2db0 13.608285200 : `APPDOMAIN` in Thread::ReadyForAbort PreparingAbort
2db0 13.608254200 : `APPDOMAIN` Thread::HandleThreadAbort throwing abort for 1
2db0 13.608253400 : `EH` TAStackCrawlCallBack: STACKCRAWL method:00007FF9DD3E00C0 (Idle.Program.Main(System.String[])) ('<n/a>'), offset 23, Frame:0000000000000000, FrameVtable = 0000000000000000 (Unknown VTable)
2db0 13.608250600 : `EH` TAStackCrawlCallBack: STACKCRAWL method:00007FF9DD3E00D8 (Idle.Program.Sleep()) ('<n/a>'), offset 29, Frame:0000000000000000, FrameVtable = 0000000000000000 (Unknown VTable)
2db0 13.608248600 : `EH` TAStackCrawlCallBack: STACKCRAWL method:00007FF9DD3B15F0 (System.Threading.Thread.Sleep(Int32)) ('<n/a>'), offset 11, Frame:0000000000000000, FrameVtable = 0000000000000000 (Unknown VTable)
2db0 13.608242700 : `EH` TAStackCrawlCallBack: STACKCRAWL method:00007FF9DD3B1068 (System.Threading.Thread.SleepInternal(Int32)) ('<n/a>'), offset 0, Frame:000000A1E677E1B8, FrameVtable = 00007FFA3D129E38 (coreclr!HelperMethodFrame::`vftable')
938c 13.593637800 : `EH` TAStackCrawlCallBack: STACKCRAWL method:00007FF9DD3E00C0 (Idle.Program.Main(System.String[])) ('<n/a>'), offset 23, Frame:0000000000000000, FrameVtable = 0000000000000000 (Unknown VTable)
938c 13.593634600 : `EH` TAStackCrawlCallBack: STACKCRAWL method:00007FF9DD3E00D8 (Idle.Program.Sleep()) ('<n/a>'), offset 29, Frame:0000000000000000, FrameVtable = 0000000000000000 (Unknown VTable)
938c 13.593632900 : `EH` TAStackCrawlCallBack: STACKCRAWL method:00007FF9DD3B15F0 (System.Threading.Thread.Sleep(Int32)) ('<n/a>'), offset 11, Frame:0000000000000000, FrameVtable = 0000000000000000 (Unknown VTable)
938c 13.593629700 : `EH` TAStackCrawlCallBack: STACKCRAWL method:00007FF9DD3B1068 (System.Threading.Thread.SleepInternal(Int32)) ('<n/a>'), offset 0, Frame:000000A1E677E1B8, FrameVtable = 00007FFA3D129E38 (coreclr!HelperMethodFrame::`vftable')
938c 13.593292800 : `APPDOMAIN` Mark Thread 000002341BD57780 Thread Id = 1 for abort (type 1)
938c 13.593273800 : `SYNC`APPDOMAIN` UserAbort Thread 000002341BD57780 Thread Id = 1
STRESS LOG:
facilitiesToLog = 0x82004000
levelToLog = 10
MaxLogSizePerThread = 0x20000 (131072)
MaxTotalLogSize = 0x2000000 (33554432)
CurrentTotalLogChunk = 12
ThreadsWithLogs = 9
Clock frequency = 1.000 GHz
Start time Last message time Total elapsed time 22.962 sec
THREAD TIMESTAMP FACILITY MESSAGE
ID (sec from start)
--------------------------------------------------------------------------------------
e04 22.961774767 : `STARTUP` ******* DllMain THREAD_DETACH called Thread dying *******
da4 21.754708632 : `APPDOMAIN` Mark Thread 0x55d9ccff98e0 Thread Id = 1 for abort (type 2)
da4 21.754708132 : `SYNC`APPDOMAIN` UserAbort Thread 0x55d9ccff98e0 Thread Id = 1
e05 18.761860847 : `ALWAYS` SetupThread managed Thread 0x7f8ff000a1a0 Thread Id = 8
------------ Last message from thread e05 -----------
e04 18.754119146 : `ALWAYS` SetupThread managed Thread 0x7f8fec009d20 Thread Id = 3
------------ Last message from thread e04 -----------
da4 15.243784346 : `APPDOMAIN` Mark Thread 0x55d9ccff98e0 Thread Id = 1 for abort (type 1)
da4 15.243782046 : `SYNC`APPDOMAIN` UserAbort Thread 0x55d9ccff98e0 Thread Id = 1
On Windows we can see the call to "Mark Thread for Abort" was quickly followed by Thread::HandleAbort() running inside the thread running Sleep() because the sleep was interrupted. On Linux there is no indication that interrupt occured, after an additional 5 seconds the debugger attempted a rude abort (type 2 abort) which also produced no result.
Based on the stepping delay observed experimentally the 100 second Sleep() never stopped on Linux, VS just gave up waiting on it to complete when the dialog was displayed. It certainly looks like we've got a reproable issue where Sleep() was expected to be interrupted on Linux but wasn't. Next step would be more investigation of the interrupt mechanism on Linux to determine why it isn't triggering in this case.
Description
During netcoredbg testing with .NET SDK 7.0 (dotnet-sdk-7.0.200-linux-x64.tar.gz). We found, that tests related to evaluation abort (timeout during execution and
System.Diagnostics.Debugger.NotifyOfCrossThreadDependency()
usage in evaluated code) are failed. We found, thatICorDebugEval::Abort
andICorDebugEval2::RudeAbort
returnS_OK
at call, but don't really abort evaluation in 7.0 (we also tested latest .NET SDK 3.1, 5.0, 6.0 and don't found any issues).Reproduction Steps
Call
ICorDebugEval::Abort
in debugger. For example, create property with getter withSystem.Diagnostics.Debugger.NotifyOfCrossThreadDependency()
and evaluate it. In this case runtime will callICorDebugManagedCallback3::CustomNotification
that forced debugger to useICorDebugEval::Abort
(CoreCLR debug API don't really provide any other way you could abort code execution during evaluation).Expected behavior
ICorDebugEval::Abort
orICorDebugEval2::RudeAbort
, it will returnS_OK
.ICorDebugManagedCallback::EvalComplete
orICorDebugManagedCallback::EvalException
called,ICorDebugEval::GetResult()
returnCORDBG_S_FUNC_EVAL_ABORTED
.Actual behavior
ICorDebugEval::Abort
orICorDebugEval2::RudeAbort
, it will returnS_OK
.ICorDebugManagedCallback::EvalComplete
orICorDebugManagedCallback::EvalException
called,ICorDebugEval::GetResult()
returnS_OK
. In case evaluated code never ends (for example, about deadlock),ICorDebugManagedCallback::EvalComplete
orICorDebugManagedCallback::EvalException
never called.Regression?
It works fine on Linux amd64 with .NET SDK 3.1, 5.0 and 6.0.
Known Workarounds
No response
Configuration
Ubuntu 20.04 amd64, .NET SDK 7.0 (dotnet-sdk-7.0.200-linux-x64.tar.gz).
Other information
No response