Reduce automated test crashes

edwardneal commented 2 weeks ago

This is an attempt to stabilise the automated tests slightly. They sometimes crash and need to be restarted; I've recently been able to reproduce the same behaviour in my local testing.

Several tests instantiate a number of threads, and these threads don't handle exceptions (such as failed test assertions.) If these assertions fail, the testhost exits.

I've converted these threads to tasks, specifying TaskCreationOptions.LongRunning. This will force the task scheduler to give the task its own thread, and it'll mean that the exceptions are propagated back to the caller. The test will thus fail, but the testhost will stay running.

This should hopefully mean that ManualTests consistently takes about 15-20 minutes to run, rather than occasionally taking 30-40 minutes.

edwardneal commented 2 weeks ago

While not strictly related to this, two comments on the ApiShould.TestSqlCommandCancel test: this is always going to be a somewhat flaky test.

The flaky part of the test methodology is as follows:

Create a SqlCommand
Start Thread A, which calls SqlCommand.ExecuteNonQuery/.ExecuteReader on it. Expect it to return an exception indicating that it's been cancelled
Start Thread B, which calls SqlCommand.Cancel

Besides using threads directly, the current code enables Thread A to run immediately, while Thread B sleeps for 0.5s. This introduces a race condition where sometimes the command could have finished executing before the cancellation runs. In such a case, the original Assert.Throws call would fail (because ExecuteNonQuery/ExecuteReader completes without throwing an InvalidOperationException.) There was also an edge case where ExecuteNonQuery/ExecuteReader might throw a SqlException rather than an InvalidOperationException. In both of these cases, the exceptions went unhandled, the thread crashed and the testhost restarted.

I've tried to rework this coordination slightly with an interlock and a tweak to the approach:

Thread A now waits for a ManualResetEvent controlled by Thread B. Thread B will only set this MRE directly before calling SqlCommand.Cancel
Thread B now hammers calls to SqlCommand.Cancel until it receives a signal on an MRE controlled by Thread A. Thread A sets this signal when the ExecuteNonQuery/ExecuteReader call has finished

This reduces the race condition, but doesn't eliminate it: calling ExecuteNonQuery and ExecuteReader implicitly resets the SqlCommand's cancellation status, so it's possible that thread B might call Cancel, then thread A might reset this. We can never guarantee that a single Cancel call (or X sequential Cancel calls) will be enough to actually cancel the command; this is why Thread B now calls Cancel in a loop.

A second comment on the same TestSqlCommandCancel test: following my changes, there's another couple of errors which look worrying:

Managed SNI, on Linux

``` [xUnit.net 00:01:29.38] Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(connection: "Data Source=tcp:10.0.0.4;Database=NORTHWIND;UID=sa"···, value: "ExecuteNonQuery") [FAIL] [xUnit.net 00:01:29.39] System.AggregateException : One or more errors occurred. (Assert.Equal() Failure: Strings differ [xUnit.net 00:01:29.39] ↓ (pos 0) [xUnit.net 00:01:29.39] Expected: "Operation cancelled by user." [xUnit.net 00:01:29.39] Actual: "The request failed to run because the bat"··· [xUnit.net 00:01:29.39] ↑ (pos 0)) [xUnit.net 00:01:29.39] ---- Assert.Equal() Failure: Strings differ [xUnit.net 00:01:29.39] ↓ (pos 0) [xUnit.net 00:01:29.39] Expected: "Operation cancelled by user." [xUnit.net 00:01:29.39] Actual: "The request failed to run because the bat"··· [xUnit.net 00:01:29.39] ↑ (pos 0) [xUnit.net 00:01:29.39] Stack Trace: [xUnit.net 00:01:29.39] at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions) [xUnit.net 00:01:29.39] at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken) [xUnit.net 00:01:29.39] at System.Threading.Tasks.Task.Wait() [xUnit.net 00:01:29.39] /_/src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs(2012,0): at Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(String connection, String value) [xUnit.net 00:01:29.39] at InvokeStub_ApiShould.TestSqlCommandCancel(Object, Span`1) [xUnit.net 00:01:29.39] at System.Reflection.MethodBaseInvoker.InvokeWithFewArgs(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture) [xUnit.net 00:01:29.39] ----- Inner Stack Trace ----- [xUnit.net 00:01:29.39] /_/src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs(3160,0): at Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.Thread_ExecuteNonQuery(Object state) [xUnit.net 00:01:29.39] at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) [xUnit.net 00:01:29.39] --- End of stack trace from previous location --- [xUnit.net 00:01:29.39] at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) [xUnit.net 00:01:29.39] at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread) ```

Native SNI, on Windows

``` [xUnit.net 00:01:29.12] Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(connection: "Data Source=tcp:10.0.0.4;Database=NORTHWIND;UID=sa"..., value: "ExecuteReader") [FAIL] [xUnit.net 00:01:29.12] System.AggregateException : One or more errors occurred. [xUnit.net 00:01:29.12] ---- Assert.Equal() Failure [xUnit.net 00:01:29.12] Γåô (pos 0) [xUnit.net 00:01:29.12] Expected: Operation cancelled by user. [xUnit.net 00:01:29.12] Actual: A severe error occurred on the current co┬╖┬╖┬╖ [xUnit.net 00:01:29.12] Γåæ (pos 0) [xUnit.net 00:01:29.13] Stack Trace: [xUnit.net 00:01:29.13] at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions) [xUnit.net 00:01:29.13] at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken) [xUnit.net 00:01:29.13] at System.Threading.Tasks.Task.Wait() [xUnit.net 00:01:29.13] /_/src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs(2013,0): at Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(String connection, String value) [xUnit.net 00:01:29.13] ----- Inner Stack Trace ----- [xUnit.net 00:01:29.13] /_/src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs(3138,0): at Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.Thread_ExecuteReader(Object state) [xUnit.net 00:01:29.13] at System.Threading.Tasks.Task.Execute() ```

I think the managed SNI exception message is probably "The request failed to run because the batch is aborted, this can be caused by abort signal sent from client, or another request is running in the same session, which makes the session busy." If so, it's another instance of #26. The native SNI exception message looks like a bug though.

While this test can fail in two ways, the failure now shows up as a test failure rather than a crash & restart of testhost. I can revert my third commit (the changes to TestSqlCommandCancel) if it's better discussed in an issue.

benrr101 commented 1 week ago

@edwardneal Thanks again for working on this. I really really appreciate any help with improving our build process.

Once we get another review, I'll be happy to merge this 👍🚀

cheenamalhotra commented 1 week ago

It seems there are test failures:

Failed Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(connection: "Data Source=tcp:10.0.0.5;Database=NORTHWIND;UID=sa"···, value: "ExecuteNonQuery") [1 s]
EXEC : error Message:  [/mnt/vss/_work/1/s/build.proj]
     System.AggregateException : One or more errors occurred. (Assert.Equal() Failure: Strings differ
             ↓ (pos 0)
  Expected: "Operation cancelled by user."
  Actual:   "The request failed to run because the bat"···
             ↑ (pos 0))
  ---- Assert.Equal() Failure: Strings differ
             ↓ (pos 0)
  Expected: "Operation cancelled by user."
  Actual:   "The request failed to run because the bat"···
             ↑ (pos 0)
    Stack Trace:
       at System.Threading.Tasks.Task.WaitAllCore(Task[] tasks, Int32 millisecondsTimeout, CancellationToken cancellationToken)
     at System.Threading.Tasks.Task.WaitAll(Task[] tasks)
     at Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(String connection, String value) in /_/src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs:line 2134
     at InvokeStub_ApiShould.TestSqlCommandCancel(Object, Span`1)
     at System.Reflection.MethodBaseInvoker.InvokeWithFewArgs(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
...

Failed Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(connection: "Data Source=tcp:10.0.0.4;Database=NORTHWIND;UID=sa"···, value: "ExecuteReader") [1 s]
EXEC : error Message:  [/mnt/vss/_work/1/s/build.proj]
     System.AggregateException : One or more errors occurred. (Assert.Equal() Failure: Strings differ
             ↓ (pos 0)
  Expected: "Operation cancelled by user."
  Actual:   "The request failed to run because the bat"···
             ↑ (pos 0))
  ---- Assert.Equal() Failure: Strings differ
             ↓ (pos 0)
  Expected: "Operation cancelled by user."
  Actual:   "The request failed to run because the bat"···
             ↑ (pos 0)
    Stack Trace:
       at System.Threading.Tasks.Task.WaitAllCore(Task[] tasks, Int32 millisecondsTimeout, CancellationToken cancellationToken)
     at System.Threading.Tasks.Task.WaitAll(Task[] tasks)
     at Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(String connection, String value) in /_/src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs:line 2080
     at InvokeStub_ApiShould.TestSqlCommandCancel(Object, Span`1)
     at System.Reflection.MethodBaseInvoker.InvokeWithFewArgs(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
...

cheenamalhotra commented 1 week ago

It also appears this test failed on Windows too in first round and passed on rerunning, so I'm a little skeptic of this change, though they look promising..

Failed Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(connection: "Data Source=tcp:10.0.0.4;Database=NORTHWIND;UID=sa"..., value: "ExecuteReader") [976 ms]
##[error]EXEC(0,0): Error Message: 
EXEC : error Message:  [D:\a\_work\1\s\build.proj]
     System.AggregateException : One or more errors occurred.
  ---- Assert.Equal() Failure
            Γåô (pos 0)
  Expected: Operation cancelled by user.
  Actual:   A severe error occurred on the current co┬╖┬╖┬╖
            Γåæ (pos 0)
    Stack Trace:
       at System.Threading.Tasks.Task.WaitAll(Task[] tasks, Int32 millisecondsTimeout, CancellationToken cancellationToken)
     at System.Threading.Tasks.Task.WaitAll(Task[] tasks, Int32 millisecondsTimeout)
     at System.Threading.Tasks.Task.WaitAll(Task[] tasks)
     at Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(String connection, String value) in /_/src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs:line 2027

edwardneal commented 6 days ago

Thanks @cheenamalhotra, that's fair. The main objective of this PR is to prevent the exceptions thrown by the secondary thread from killing the testhost process. At the moment, a flaky test kills the process and CI restarts it; this adds 10-15 minutes to the build time. With this change, the flaky test fails, the process returns a non-zero exit code and CI restarts it.

I hadn't spotted the second restart criteria earlier, so this PR doesn't reduce the build time as I first thought. It does make some previously-failing tests visible though, and hopefully some future work can start moving progressively more tests into a test run which doesn't have that restart criteria, allowing us to shift the remaining ones towards restarting the individual test rather than the entire testhost.

The exception message which starts with "The request failed to run because the bat" is a simple enough fix, I'm pretty sure it's just another instance of efcore#29861 and will change the test to account for that. I'm more concerned about the exception with the corrupted message string though. That looks a little like the native SNI's SNIGetLastError method is overwriting or freeing the memory behind the SNI_Error.errorMessage member which it returns to TdsParser.

Edit: looks like there are a few additional possible exception messages which can report cancellation. That's a little frustrating - I'd hoped running it until it failed in VS would catch them all. I'll keep looking...

dotnet / SqlClient

Reduce automated test crashes #2968