The message pump failed to stop within the time allowed (30s)

davidvedvick commented 4 years ago

When our SQL Servers are patched, the NServiceBus services attempt to shutdown but they leave this message and remain running:

The message pump failed to stop within the time allowed (30s)

SQL Server version: 14.0.3257.3 NServiceBus version: 7.1.10 NServiceBus.SqlServer version: 4.3.0 DotNetCore version: 2.2.8

This is running inside a Linux docker container.

Any thoughts on what we can do to resolve this issue?

SzymonPobiega commented 4 years ago

Can you share you endpoint configuration and the log files via our support channels? The message pump is probably stopping because the circuit breaker triggered endpoint shutdown. In order for the pump to stop it waits for all the processing threads to finish their message processing. If these threads are waiting on their TCP connections to (now not available) SQL Server that might potentially take longer than 30 seconds.

davidvedvick commented 4 years ago

Apologies for my delayed response. Our configuration looks like this:

endpointConfiguration.UseTransport<SqlServerTransport>()
    .ConnectionString(() => "connection string")
    .DefaultSchema("SchemaName")
    .Transactions(TransportTransactionMode.SendsAtomicWithReceive)
    .UseSchemaForEndpoint(
        "QueueName",
        "SchemaName")
    .TimeToWaitBeforeTriggeringCircuitBreaker(TimeSpan.FromHours(3))
    .UseNativeDelayedDelivery()
    .DisableTimeoutManagerCompatibility();

endpointConfiguration.Recoverability()
    .Delayed(settings => settings.NumberOfRetries(3)
        .TimeIncrease(TimeSpan.FromSeconds(10)))
        .Immediate(settings => 5))

Keeping any private information out of the connection string, the database connection string looks like this:

Server=server_name;User id=user_name;Password=password;database=dbname;Enlist=False;

SzymonPobiega commented 4 years ago

OK. Then we need the log files from the time when it happened. Can you open a support case and attach them? Looking at your config it seems you don't have a custom critical error handler so the endpoint should just continue running when the SQL Server is down. We need to log files to figure to first figure our why is it shutting down and then we can pinpoint why the shutdown takes so long.

davidvedvick commented 4 years ago

Sorry, I didn't think to include the critical error handler. Here is what our's looks like:

private async Task OnCriticalError(ICriticalErrorContext context)
{
    // TODO: decide if stopping the endpoint and exiting the process is the best response to a critical error
    // https://docs.particular.net/nservicebus/hosting/critical-errors
    // and consider setting up service recovery
    // https://docs.particular.net/nservicebus/hosting/windows-service#installation-restart-recovery
    try
    {
        await this.EnsureShutdown();
    }
    finally
    {
        FailFast($"Critical error, shutting down: {context.Error}", context.Exception);
    }
}

// This method was added in to work around the message pump failure.
private async Task EnsureShutdown()
{
    var stopTask = this.StopEndpoint();

    var timeoutSpan = TimeSpan.FromMinutes(1);
    var timeoutTask = Task.Delay(timeoutSpan);
    var winningTask = await Task.WhenAny(stopTask, timeoutTask);

    if (winningTask == timeoutTask)
    {
        throw new TimeoutException($"Timed out after waiting {timeoutSpan} for the endpoint to stop");
    }
}

private async Task StopEndpoint()
{
    try
    {
        await (this.sqlEndpoint?.Stop() ?? Task.CompletedTask);
    }
    catch (Exception ex)
    {
        FailFast("Failed to stop correctly.", ex);
    }
}

Here are the log entries before the log entry above:

{"@t":"2020-04-19T14:29:24.3229949Z","@m":"Initiating shutdown.","@i":"30c71628","SourceContext":"NServiceBus.RunningEndpointInstance","ThreadId":80}

{"@t":"2020-04-19T14:29:24.3105531Z","@m":"Failed to peek SchemaName@[nservicebus]@[DBName]","@i":"9879b297","@l":"Fatal","@x":"System.Data.SqlClient.SqlException (0x80131904): SHUTDOWN is in progress.\nLogin failed for user 'user'.\nCannot continue the execution because the session is in the kill state.\nA severe error occurred on the current command.  The results, if any, should be discarded.\n   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)\n   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)\n   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)\n   at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)\n   at System.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject stateObj, Boolean isDelegateControlRequest)\n   at System.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransactionYukon(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest)\n   at System.Data.SqlClient.SqlInternalConnection.BeginSqlTransaction(IsolationLevel iso, String transactionName, Boolean shouldReconnect)\n   at System.Data.SqlClient.SqlConnection.BeginTransaction(IsolationLevel iso, String transactionName)\n   at NServiceBus.Transport.SQLServer.QueuePeeker.Peek(TableBasedQueue inputQueue, RepeatedFailuresOverTimeCircuitBreaker circuitBreaker, CancellationToken cancellationToken)\nClientConnectionId:4abcbd09-bb13-4fb9-adae-8badb1e3f682\nError Number:6005,State:1,Class:14","SourceContext":"NServiceBus","ThreadId":80}

{"@t":"2020-04-19T14:29:24.3079375Z","@m":"The circuit breaker for \"SqlPeek\" will now be triggered","@i":"bf1bfe33","@l":"Warning","0":"SqlPeek","SourceContext":"NServiceBus.Transport.SQLServer.RepeatedFailuresOverTimeCircuitBreaker","ThreadId":80}

Again, apologies for the poor formatting. I'll open a support case with the same information.

SzymonPobiega commented 4 years ago

@danrien the issue is caused by the behavior of ADO.NET library where some API calls don't return error immediately if the remote server appears to be down. When the 30 second time passes, the message pump stops. The message processing tasks that have been spawned before the pump stopped will likely fails with exception. This should not set your system in an incorrect state as every message processing is transactional.

In version 7 of NServiceBus you can safely remove your critical error handler unless you have some other than SQL database resources that should trigger endpoint shutdown. SQL Server transport for NServiceBus 7 is designed to handle SQL Server unavailability gracefully and reconnect as soon as the database is up again.

davidvedvick commented 4 years ago

Sorry @SzymonPobiega is the suggestion then that the critical error handler is causing the issue?

SzymonPobiega commented 4 years ago

@danrien well, the underlying issue is caused by inability to interrupt the ongoing SQL commands. Here's what is happening:

A transport tries to get a new connection from the pool and fetch another message
The fetch command fails as the SQL server is not available
Meanwhile, existing connections that have been open but transactions are not committed still look "fine"
The circuit breaker is triggered and the critical error handler executed
The critical error handler tells the endpoint to shut down
The endpoint tells the SQL transport to shut down
The transport logic is designed to wait at most 30 seconds for the message processing tasks to stop and, if they don't, it logs this warning
At this point the message processing tasks are still waiting on SQL Server to reply as they do have open TCP connection that still seem fine (although the SQL Server is down)
The transport stops after 30 seconds, even though the processing tasks are still not finished
The endpoint stops

These runaway tasks are going to fail, anyway, because their TCP connections are not going to recover. Messages won't be lost as the transactions they were using have been rolled back at the moment you turned the SQL Server off for maintenance. They are just zombies.

So if you don't expect any specific critical errors for which you need to shutdown your endpoint to prevent damage, you can safely remove the critical error handler. In the meantime we will attempt to improve the shutdown process and look for ways in which we can interrupt these message processing tasks to make the behavior a bit more clean.

davidvedvick commented 4 years ago

Ok, thanks for the feedback, we'll take that and simplify our NServiceBus host... our issue is, however, the NServiceBus shutdown hanging on SQL Server restart (we have since validated that it hangs on any restart, not just when the SQL server is getting patched), as in the process hosting the service never exits, with the last message being the one posted in this issue's subject. Using the EnsureShutdown method does help us force a shutdown of the process, but it seems to be working around a deeper issue. I'm happy to continue using the EnsureShutdown method if that is the best path forward for now.

We've also noticed in our production environment where we have an AG that when the SQL Servers go through their patching processes, the service arms the circuit breaker but recovers and closes it once the AG is restored, so SQL Server errors don't seem to have this hanging effect in some SQL server configurations, and luckily, the most important for us.

SzymonPobiega commented 4 years ago

the NServiceBus shutdown hanging on SQL Server restart

So the issue is that when you restart SQL Server, NServiceBus message pump stops and does not resume processing once the SQL Server is up again?

davidvedvick commented 4 years ago

Yes that seems to be the case. It seems to hang when the circuit breaker is armed, perhaps at the bolded point?

A transport tries to get a new connection from the pool and fetch another message
The fetch command fails as the SQL server is not available
Meanwhile, existing connections that have been open but transactions are not committed still look "fine"

If the existing connections remain open, is it possible they won't recover until the service fails and restarts?

It is likely there is an issue in our error handling as well, as I don't think we're properly ending the process when we shut down the endpoint at this stage:

The critical error handler tells the endpoint to shut down

But, fixing that, I think we would still have the issue where it won't recover because the existing connections are remaining open?

SzymonPobiega commented 4 years ago

@danrien I tried to reproduce this behavior with Azure SQL but I was not able. There is no option to stop the Azure SQL instance so I used the scale-up/scale-down to force bringing the instance down. In all cases the endpoint lost connectivity and armed the circuit breaker but was able to resume work as soon as the Azure SQL instance was back again.

Can you send us the logs from one such even to the support e-mail address? We will try investigate what can be wrong. I suspect it has something to do with connection pool handling.

DavidBoike commented 3 years ago

@danrien @SzymonPobiega I believe this will be fixed in SQL Transport 7.x (for NServiceBus 8.x) because we are implementing cancellation tokens throughout NServiceBus and all related components. Calling endpoint.Stop() will start the graceful shutdown attempt, but the 30s default thing will be gone and you'll be able to pass a CancellationToken with any timeout you want, say 5 seconds. After the end of that 5 seconds, the cancellation token will flip and SQL Server should almost immediately abort all active commands. SQL sometimes returns SqlException instead of OperationCanceledException so there may be some error log output, but then the endpoint will legitimately shutdown rather quickly.

Unfortunately it required breaking changes in NServiceBus 8 and SQL Transport to make that happen, so there's really no way to backport this to current versions, and NServiceBus 8 isn't ready to ship yet either, but at least a fix is coming.

Until then, Environment.FailFast() can be used to kill the process.

Is that enough to close this issue?

DavidBoike commented 3 years ago

Seeing no objection after 24h hours I'll close the issue. Feel free to reopen if warranted.

davidvedvick commented 3 years ago

@DavidBoike yes I think it is. We've managed to work around it since we opened this issue, but cancellation tokens throughout will make the solution much cleaner. Thanks!

Particular / NServiceBus.SqlServer

The message pump failed to stop within the time allowed (30s) #595