Particular / NServiceBus.SqlServer

SQL Server Transport for NServiceBus
https://docs.particular.net/nservicebus/sqlserver/
Other
42 stars 36 forks source link

Messages may be lost in cases of database connectivity problems #954

Closed MarcWils closed 2 years ago

MarcWils commented 2 years ago

Symptoms

Messages are removed from the queue but message handlers are not executed for them and these messages are not audited -- they are lost leaving no trace.

Who's affected

Affected are possibly all users of SQL Server transport but the issue has only been confirmed in one environment that uses Azure SQL.

Original description

We're in the middle of investigating a performance issue. The Azure SQL database containing all of our queue tables hits a 100% CPU spike for a long period of time. This results is many timeouts, transactions errors and so. That is to be expected and something we're trying to figure out. However, it seems that some messages were "lost" during that time. There's no sign these messages were put aside for delayed delivery or moved to the error queue.

We use a recent version of NServiceBus v7 and NServiceBus.SqlServer v6.3 (System.Data.SqlClient). Endpoints are running in transaction mode TransactionScope and use native delayed delivery. Exceptions bubble up to the NServiceBus infrastructure triggering immediate and possibly delayed retries.

Message loss is something we've never seen before and only seem to occasionally happen during the Azure SQL CPU spike. As I said, there are many timeout exceptions, transaction in doubt exceptions and so on. But I can't relate them to the messages we've lost.

When digging through the logs and code, I've noticed this issue #848. Our logs contains multiple warning "peekCommand returned a null value." during the time of database problems. I assume SQL actually returns a result for the peek query, but the result is lost. I was wondering if the same issue might occur when receiving a message from the input queue. A SqlDataReader is used to read the message: https://github.com/Particular/NServiceBus.SqlServer/blob/ca40c86287618aefa76bce8c45df93aa5cd6bea9/src/NServiceBus.Transport.SqlServer/Queuing/TableBasedQueue.cs#L83-L88

That's basically the same construction the ExecuteScalar method uses: https://github.com/microsoft/referencesource/blob/5697c29004a34d80acdaf5742d7e699022c64ecd/System.Data/System/Data/SqlClient/SqlCommand.cs#L1162-L1171

What if there's an actual scenario where the .ReadAsync() method returns false, but SQL did delete and return a message from the queue. The transaction will committed and no warning/error is logged: https://github.com/Particular/NServiceBus.SqlServer/blob/ca40c86287618aefa76bce8c45df93aa5cd6bea9/src/NServiceBus.Transport.SqlServer/Receiving/ProcessWithTransactionScope.cs#L28-L35

Wouldn't it be safer to not commit the transaction if no record was returned from SQL?

tmasternak commented 2 years ago

Hi @MarcWils let me start by trying to answer:

What if there's an actual scenario where the .ReadAsync() method returns false, but SQL did delete and return a message from the queue

the exact logic inside System.Data is quite complex but here is a simplified overview. Please note that this is incomplete, and based on my understanding of how the package works.

ExecuteReaderAsync gets a connection and an instance of TdsParser to create SqlDataReader. The command first executed Tds RPC and later passes the response stream to TdsParser. The parser is responsible for parsing bytes received from an SQL instance according to the TDS protocol specification. Each piece of data (TDS Packet) that the parser operates on starts with the header information that indicates the type of the packet and its length. The length part here is crucial, as it enables the parser to know if there are some outstanding data that it has not yet received.

Even when ReadAsync() returns false there is still TDS Packet coming back from the server that acknowledges command execution and indicates no data has been produced when the command was executed. In a scenario when the connection would fail exactly when the command was getting back data from the server the parser (based on the length fields) would make sure that no partial results are returned.

I hope the above puts some light on the behavior.

Could you share a bit more on the message loss scenario? How do you detect that messages are being lost?

MarcWils commented 2 years ago

Hi @tmasternak

Thanks for the response. I agree, the logic in System.Data is hard to follow. Especially if the CommandBehavior.SequentialAccess is used which again changes the behavior of the SqlDataReader.

We have a system where messages flow from one endpoint to another. All endpoints use a recent version of NServiceBus v7 and NServiceBus.SqlServer v6.3 (System.Data.SqlClient). They run with transaction level Transaction Scope and use native delayed delivery. The outbox feature is not used. All of our transport queue tables are stored in a single Azure SQL database. Some endpoints access business data in another Azure SQL database. Transactions are escalated to elastic transactions.

We've had a production problem with our transport database hitting a 100% CPU spike. This seems to be resolved by rebuilding the queue tables and/or updating statistics. During the spike we got a lot of SQL timeouts and transaction errors. That's to be expected. However, we also noticed message flows were randomly stopped. This happened in endpoints using only the transport database as well in endpoints which also access a business database. It's as if some messages were randomly dropped from the queue. The message loss seem to be unrelated to the logged exceptions.

One explanation would be that the receival of the incoming messages (DELETE) was committed, but the INSERT of outgoing messages failed.

Another explanation is related to #848. I would love to hear in which conditions the ExecuteScalarAsync returns null. That would help trying to understand if the same issue might occur when receiving a message. The "peekCommand returned a null value." message is namely one of the many exceptions that occurred during the CPU spike. If you take a look at Dapper, they always consume the SqlDataReader properly to ensure any exceptions flow to .NET (https://github.com/DapperLib/Dapper/issues/1210).

So if there's a scenario in which the SqlDataReader would return false (partial read, connection error, ...), this would silently fail. I can't prove there's an issue, but I can't rule it out.

Anyway, in the context of defensive programming, it would be safer to only commit the transaction if a message was received from the queue.

SzymonPobiega commented 2 years ago

@MarcWils I took liberty to edit the issue description and added some general information on top and slightly reworded the title since the issue is now confirmed.