Open edwardneal opened 2 weeks ago
While not strictly related to this, two comments on the ApiShould.TestSqlCommandCancel
test: this is always going to be a somewhat flaky test.
The flaky part of the test methodology is as follows:
Besides using threads directly, the current code enables Thread A to run immediately, while Thread B sleeps for 0.5s. This introduces a race condition where sometimes the command could have finished executing before the cancellation runs. In such a case, the original Assert.Throws call would fail (because ExecuteNonQuery/ExecuteReader completes without throwing an InvalidOperationException.) There was also an edge case where ExecuteNonQuery/ExecuteReader might throw a SqlException rather than an InvalidOperationException. In both of these cases, the exceptions went unhandled, the thread crashed and the testhost restarted.
I've tried to rework this coordination slightly with an interlock and a tweak to the approach:
This reduces the race condition, but doesn't eliminate it: calling ExecuteNonQuery and ExecuteReader implicitly resets the SqlCommand's cancellation status, so it's possible that thread B might call Cancel, then thread A might reset this. We can never guarantee that a single Cancel call (or X sequential Cancel calls) will be enough to actually cancel the command; this is why Thread B now calls Cancel in a loop.
A second comment on the same TestSqlCommandCancel test: following my changes, there's another couple of errors which look worrying:
I think the managed SNI exception message is probably "The request failed to run because the batch is aborted, this can be caused by abort signal sent from client, or another request is running in the same session, which makes the session busy." If so, it's another instance of #26. The native SNI exception message looks like a bug though.
While this test can fail in two ways, the failure now shows up as a test failure rather than a crash & restart of testhost. I can revert my third commit (the changes to TestSqlCommandCancel) if it's better discussed in an issue.
@edwardneal Thanks again for working on this. I really really appreciate any help with improving our build process.
Once we get another review, I'll be happy to merge this 👍🚀
It seems there are test failures:
Failed Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(connection: "Data Source=tcp:10.0.0.5;Database=NORTHWIND;UID=sa"···, value: "ExecuteNonQuery") [1 s]
EXEC : error Message: [/mnt/vss/_work/1/s/build.proj]
System.AggregateException : One or more errors occurred. (Assert.Equal() Failure: Strings differ
↓ (pos 0)
Expected: "Operation cancelled by user."
Actual: "The request failed to run because the bat"···
↑ (pos 0))
---- Assert.Equal() Failure: Strings differ
↓ (pos 0)
Expected: "Operation cancelled by user."
Actual: "The request failed to run because the bat"···
↑ (pos 0)
Stack Trace:
at System.Threading.Tasks.Task.WaitAllCore(Task[] tasks, Int32 millisecondsTimeout, CancellationToken cancellationToken)
at System.Threading.Tasks.Task.WaitAll(Task[] tasks)
at Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(String connection, String value) in /_/src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs:line 2134
at InvokeStub_ApiShould.TestSqlCommandCancel(Object, Span`1)
at System.Reflection.MethodBaseInvoker.InvokeWithFewArgs(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
...
Failed Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(connection: "Data Source=tcp:10.0.0.4;Database=NORTHWIND;UID=sa"···, value: "ExecuteReader") [1 s]
EXEC : error Message: [/mnt/vss/_work/1/s/build.proj]
System.AggregateException : One or more errors occurred. (Assert.Equal() Failure: Strings differ
↓ (pos 0)
Expected: "Operation cancelled by user."
Actual: "The request failed to run because the bat"···
↑ (pos 0))
---- Assert.Equal() Failure: Strings differ
↓ (pos 0)
Expected: "Operation cancelled by user."
Actual: "The request failed to run because the bat"···
↑ (pos 0)
Stack Trace:
at System.Threading.Tasks.Task.WaitAllCore(Task[] tasks, Int32 millisecondsTimeout, CancellationToken cancellationToken)
at System.Threading.Tasks.Task.WaitAll(Task[] tasks)
at Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(String connection, String value) in /_/src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs:line 2080
at InvokeStub_ApiShould.TestSqlCommandCancel(Object, Span`1)
at System.Reflection.MethodBaseInvoker.InvokeWithFewArgs(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
...
It also appears this test failed on Windows too in first round and passed on rerunning, so I'm a little skeptic of this change, though they look promising..
Failed Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(connection: "Data Source=tcp:10.0.0.4;Database=NORTHWIND;UID=sa"..., value: "ExecuteReader") [976 ms]
##[error]EXEC(0,0): Error Message:
EXEC : error Message: [D:\a\_work\1\s\build.proj]
System.AggregateException : One or more errors occurred.
---- Assert.Equal() Failure
Γåô (pos 0)
Expected: Operation cancelled by user.
Actual: A severe error occurred on the current co┬╖┬╖┬╖
Γåæ (pos 0)
Stack Trace:
at System.Threading.Tasks.Task.WaitAll(Task[] tasks, Int32 millisecondsTimeout, CancellationToken cancellationToken)
at System.Threading.Tasks.Task.WaitAll(Task[] tasks, Int32 millisecondsTimeout)
at System.Threading.Tasks.Task.WaitAll(Task[] tasks)
at Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(String connection, String value) in /_/src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs:line 2027
Thanks @cheenamalhotra, that's fair. The main objective of this PR is to prevent the exceptions thrown by the secondary thread from killing the testhost process. At the moment, a flaky test kills the process and CI restarts it; this adds 10-15 minutes to the build time. With this change, the flaky test fails, the process returns a non-zero exit code and CI restarts it.
I hadn't spotted the second restart criteria earlier, so this PR doesn't reduce the build time as I first thought. It does make some previously-failing tests visible though, and hopefully some future work can start moving progressively more tests into a test run which doesn't have that restart criteria, allowing us to shift the remaining ones towards restarting the individual test rather than the entire testhost.
The exception message which starts with "The request failed to run because the bat" is a simple enough fix, I'm pretty sure it's just another instance of efcore#29861 and will change the test to account for that. I'm more concerned about the exception with the corrupted message string though. That looks a little like the native SNI's SNIGetLastError
method is overwriting or freeing the memory behind the SNI_Error.errorMessage
member which it returns to TdsParser.
Edit: looks like there are a few additional possible exception messages which can report cancellation. That's a little frustrating - I'd hoped running it until it failed in VS would catch them all. I'll keep looking...
This is an attempt to stabilise the automated tests slightly. They sometimes crash and need to be restarted; I've recently been able to reproduce the same behaviour in my local testing.
Several tests instantiate a number of threads, and these threads don't handle exceptions (such as failed test assertions.) If these assertions fail, the testhost exits.
I've converted these threads to tasks, specifying TaskCreationOptions.LongRunning. This will force the task scheduler to give the task its own thread, and it'll mean that the exceptions are propagated back to the caller. The test will thus fail, but the testhost will stay running.
This should hopefully mean that ManualTests consistently takes about 15-20 minutes to run, rather than occasionally taking 30-40 minutes.