Eventuous / eventuous

Event Sourcing library for .NET
https://eventuous.dev
Apache License 2.0
442 stars 70 forks source link

Transient error drops the subscription. (Connection reset by peer) #307

Open PehrGit opened 7 months ago

PehrGit commented 7 months ago

Describe the bug We noticed that a subscription had stopped processing. We discovered that it was due to a SqlException with inner exception SocketException with message:

"A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 35 - An internal exception was caught) Unable to read data from the transport connection: Connection reset by peer. Connection reset by peer".

I believe this should be recognized as a transient error and retried, like the other error numbers listed in https://github.com/Eventuous/eventuous/blob/b7352bb3b6565dd974b74a35655d782cea08dc08/src/SqlServer/src/Eventuous.SqlServer/Subscriptions/SqlServerSubscriptionBase.cs#L61

To Reproduce Steps to reproduce the behavior:

Expected behavior The error is recognized as transient and the message is retried.

Screenshots N/A

Desktop (please complete the following information):

Additional context There is no additional logging because this didn't occur during the processing of a message, it was in the middle of the night and nobody was using the system. So we assume it was just a hiccup on the Azure side.

Full stack trace:

Microsoft.Data.SqlClient.SqlException:
   at Microsoft.Data.SqlClient.SqlCommand.EndExecuteReaderAsync (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Eventuous.SqlServer.Subscriptions.SqlServerSubscriptionBase`1+<PollingQuery>d__15.MoveNext (Eventuous.SqlServer, Version=0.15.0.0, Culture=neutral, PublicKeyToken=null)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Eventuous.SqlServer.Subscriptions.SqlServerSubscriptionBase`1+<PollingQuery>d__15.MoveNext (Eventuous.SqlServer, Version=0.15.0.0, Culture=neutral, PublicKeyToken=null)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Eventuous.SqlServer.Subscriptions.SqlServerSubscriptionBase`1+<PollingQuery>d__15.MoveNext (Eventuous.SqlServer, Version=0.15.0.0, Culture=neutral, PublicKeyToken=null)
Inner exception System.IO.IOException handled at Microsoft.Data.SqlClient.SqlCommand.EndExecuteReaderAsync:
   at System.Net.Sockets.Socket+AwaitableSocketAsyncEventArgs.ThrowException (System.Net.Sockets, Version=6.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
   at System.Net.Sockets.Socket+AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource<System.Int32>.GetResult (System.Net.Sockets, Version=6.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
   at Microsoft.Data.SqlClient.SNI.SNINetworkStream+<ReadAsync>d__1.MoveNext (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Data.SqlClient.SNI.SslOverTdsStream+<ReadAsync>d__5.MoveNext (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Net.Security.SslStream+<EnsureFullTlsFrameAsync>d__186`1.MoveNext (System.Net.Security, Version=6.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Net.Security.SslStream+<ReadAsyncInternal>d__188`1.MoveNext (System.Net.Security, Version=6.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Data.SqlClient.SNI.SNISslStream+<ReadAsync>d__1.MoveNext (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
Inner exception System.Net.Sockets.SocketException handled at System.Net.Sockets.Socket+AwaitableSocketAsyncEventArgs.ThrowException:
alexeyzimarev commented 7 months ago

I will accept the PR if you add this error to the list you mentioned. I am not sure that SQL error number will be 35 though.

alexeyzimarev commented 7 months ago

Look also here https://github.com/dotnet/SqlClient/issues/2103#issuecomment-1764206103, it seems that on Windows it will produce 10053, but on Linux it's impossible to figure out.

alexeyzimarev commented 7 months ago

Strangely enough, the only thing that should have happened is that the subscription would drop and resubscribe. Can you confirm that the subscription just silently died? Do you have any health checks set up using the provided diagnostics methods? https://eventuous.dev/docs/subscriptions/subs-diagnostics/#health-checks

PehrGit commented 7 months ago

Strangely enough, the only thing that should have happened is that the subscription would drop and resubscribe. Can you confirm that the subscription just silently died? Do you have any health checks set up using the provided diagnostics methods? https://eventuous.dev/docs/subscriptions/subs-diagnostics/#health-checks

I've tested again by running the app locally and stopping the SqlServer instance. I see the "Dropped" message in the logs but it doesn't resubscribe, and the health check keeps outputting "Healthy".

It makes sense, as the Resubscribe() method is only called from Dropped(), which is not called when the polling connection fails. https://github.com/Eventuous/eventuous/blob/b7352bb3b6565dd974b74a35655d782cea08dc08/src/SqlServer/src/Eventuous.SqlServer/Subscriptions/SqlServerSubscriptionBase.cs#L63-L96

It is only called from HandleInternal, and only in the case of an OperationCanceledException https://github.com/Eventuous/eventuous/blob/b7352bb3b6565dd974b74a35655d782cea08dc08/src/Core/src/Eventuous.Subscriptions/EventSubscriptionWithCheckpoint.cs#L41-L54

PehrGit commented 7 months ago

Look also here dotnet/SqlClient#2103 (comment), it seems that on Windows it will produce 10053, but on Linux it's impossible to figure out.

Ah that's too bad. Thanks for looking into that!

I suppose we should focus on getting the SQL subscription to resubscribe when the connection drops, that should also fix this issue?

PehrGit commented 7 months ago

Looking at the ESDB AllStreamSubscription, I see that EventSubscription.Dropped() is called when the subscription drops.

Could it be that we just need to replace IsDropped = true; with a call to .Dropped() in this method?

Edit: it looks like this is already fixed in dev, where Dropped(DropReason.ServerError, e); is called in the catch: https://github.com/Eventuous/eventuous/blob/0be16566922589befc985e61f750ef88c071641c/src/Relational/src/Eventuous.Sql.Base/Subscriptions/SqlSubscriptionBase.cs#L51-L83