Timeout awaiting response - SignalR Backplane

LastTribunal commented 1 year ago

My org is using StackExchange.Redis .net library (v: 2.2.62.27853) for Signalr Backplane with Azure Redis Service in a multi-node VM cluster.

We have observed that whenever there is an outage in Azure Redis Service, any (not any particular) one or more of our cluster nodes redis client becomes corrupt and logs the error below.

The error persists after Azure Redis Service is recovered. The error persists continuously, and even under a minimal server load The error is resolved only after rebooting the cluster node.

Timeout awaiting response (outbound=0KiB, inbound=0KiB, 5563ms elapsed, timeout is 5000ms), command=EVAL, next: EVAL, inst: 0, qu: 0, qs: 4, aw: False, rs: ReadAsync, ws: Idle, in: 1552, serverEndpoint: redisprod.redis.cache.windows.net:6380, mc: 1/1/0, mgr: 10 of 10 available, clientName: Nodes0004, IOCP: (Busy=29,Free=971,Min=8,Max=1000), WORKER: (Busy=2,Free=32765,Min=8,Max=32767), v: 2.2.62.27853 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

NickCraver commented 1 year ago

Can you please try the latest version and see if you can reproduce? There have been a few related fixes here in later versions that hopefully cover your cause as well - we handle some issues a bit better for these scenarios.

LastTribunal commented 1 year ago

Thank you sir. We are doing exactly that, will post updates

NickCraver commented 1 year ago

@LastTribunal any luck with version changes?

LastTribunal commented 1 year ago

So far we havent seen any issues, but also there has not been any Azure Redis outages, which has been the trigger for previous faults. Will keep you posted

LastTribunal commented 1 year ago

Can you please try the latest version and see if you can reproduce? There have been a few related fixes here in later versions that hopefully cover your cause as well - we handle some issues a bit better for these scenarios.

After a couple of months we finally saw an issue. however, unlike in the past, the new redis client version we installed, self-healed without us having to restart the VM. So it seems the issue still exists, but is being managed internally.

we saw two nodes out of five,, experience this for 10 minutes each within a few hours..

NickCraver commented 1 year ago

@LastTribunal Did you have any exception from that? Could this be the ~6 weekly patching that happens across Azure nodes?

IfOnTuesdayAndNotRainingThenDoThis commented 1 year ago

Hello @NickCraver , I am a colleague of @LastTribunal and this issue still persists. We are experiencing this on a bi daily bases with 10-20 min outage in our production environment. This issue causes massive disconnects and load on our servers and all clients are trying to get back. We engaged with Azure Network and Redis team and they are not able to find any issue on their end. They requested us to tweak the IOCP min thread, but we feel like thats a rabbit hole we dont want to go in. We also tried this on a console and tweaking does not help.

Any assistance will be helpful and we are running out of ideas.

Thanks,

NickCraver commented 1 year ago

@IfOnTuesdayAndNotRainingThenDoThis Do you have some more recent error samples to look at?

IfOnTuesdayAndNotRainingThenDoThis commented 1 year ago

@NickCraver

Here is the exception:

StackExchange.Redis.RedisTimeoutException at StackExchange.Redis.RedisDatabase+d__223.MoveNext

Timeout awaiting response (outbound=0KiB, inbound=0KiB, 5437ms elapsed, timeout is 5000ms), command=EVAL, next: EVAL, inst: 0, qu: 0, qs: 0, aw: False, bw: Inactive, rs: ReadAsync, ws: Idle, in: 5256, last-in: 0, cur-in: 0, sync-ops: 4, async-ops: 481539, serverEndpoint: er-prod-redis.redis.cache.windows.net:6380, conn-sec: 384449.3, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: NodesPrd2000003(SE.Redis-v2.6.104.40210), IOCP: (Busy=72,Free=928,Min=8,Max=1000), WORKER: (Busy=5,Free=32762,Min=8,Max=32767), v: 2.6.104.40210 (Please take a

look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Its a surge for 100+ exceptions and happens to all our vm nodes at the same time.

Here are the version we are using RedisVerison

IfOnTuesdayAndNotRainingThenDoThis commented 1 year ago

@NickCraver

Update:

we were able to capture the exception we got on the clients that were subscribed:

SignalR connection error: "System.Net.WebSockets.WebSocketException (0x80004005): The remote party closed the WebSocket connection without completing the close handshake. at System.Net.WebSockets.WebSocketBase.WebSocketOperation.d19.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Net.WebSockets.WebSocketBase.d45.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.AspNet.SignalR.WebSockets.WebSocketMessageReader.d3.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.AspNet.SignalR.WebSockets.WebSocketHandler.d25.MoveNext()"

We also saw signalR connect spike that caused the redis timeout, right after wards. We cant seem to find the cause of these disconnects.

StackExchange / StackExchange.Redis

Timeout awaiting response - SignalR Backplane #2398