StackExchange / StackExchange.Redis

General purpose redis client
https://stackexchange.github.io/StackExchange.Redis/
Other
5.86k stars 1.5k forks source link

Persistent timeouts all of a sudden #2567

Open JLRishe opened 10 months ago

JLRishe commented 10 months ago

I've been using Redis for several years via the RedisSessionStateProvider from an Azure web app.

From time to time, I've encountered a situation where the session state provider would start failing with errors due to timeouts while executing Redis operations. Until now, I've been able to resolve these by restarting the web app which, I suppose, clears the queue of Redis requests. Last February, I posted an issue about that where I was given a suggestion to upgrade the relevant NuGet packages, which I've done, and haven't encountered an issue until today.

Currently installed packages:

Today, I encountered an issue where the app was experiencing a flurry of Redis timeouts. This time, restarting the web app did not resolve the issue, but rather, the timeouts came back almost immediately. I also tried failing over my web app to another, usually dormant, web app instance in a different Azure region, and the timeouts immediately started showing up in that instance. I also tried using the Reboot panel on the Redis cache in the Azure Portal, but to no avail.

Here are some examples of the errors I was receiving:

Timeout performing EVAL (1000ms), next: EVAL, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 0, last-in: 0, cur-in: 0, sync-ops: 24839, async-ops: 1, serverEndpoint: *****.redis.cache.windows.net:6380, conn-sec: 3632.63, mc: 1/1/0, mgr: 10 of 10 available, clientName: dw0sdwk0001C6(SE.Redis-v2.6.90.64945), IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=9,Free=8182,Min=1,Max=8191), v: 2.6.90.64945

Timeout performing EVAL (1000ms), next: EVAL, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 1082, last-in: 0, cur-in: 0, sync-ops: 24839, async-ops: 1, serverEndpoint: *****.redis.cache.windows.net:6380, conn-sec: 3632.7, mc: 1/1/0, mgr: 10 of 10 available, clientName: dw0sdwk0001C6(SE.Redis-v2.6.90.64945), IOCP: (Busy=1,Free=999,Min=1,Max=1000), WORKER: (Busy=8,Free=8183,Min=1,Max=8191), v: 2.6.90.64945

Timeout performing EVAL (1000ms), next: EVAL, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 1082, last-in: 0, cur-in: 0, sync-ops: 24839, async-ops: 1, serverEndpoint: *****.redis.cache.windows.net:6380, conn-sec: 3632.71, mc: 1/1/0, mgr: 10 of 10 available, clientName: dw0sdwk0001C6(SE.Redis-v2.6.90.64945), IOCP: (Busy=1,Free=999,Min=1,Max=1000), WORKER: (Busy=7,Free=8184,Min=1,Max=8191), v: 2.6.90.64945

Until now, I have been using a C0 size Redis cache, but today, I had the thought that it being shared infrastructure might be the cause of the issue, and scaled up to C1. At the same, I scaled out my web app from 2 instances to 4 in case the bottleneck was within the app.

After performing these two actions, the timeout errors drastically decreased, but did not go away entirely, occurring a few times every few minutes. Naturally, I'm not sure which of these two changes actually had the effect, or whether it was a combination of both.

Monitoring in Azure shows the traffic to my web app to be consistent with other weekdays from the past month, but I'm wondering if this could be simply a matter of the app accessing session state at a pace where the requests to the Redis cache can't keep up?

I would greatly appreciate any advice as to how to troubleshoot or improve this issue.

Current settings:

        <add name="RedisCacheStore" type="Microsoft.Web.Redis.RedisSessionStateProvider"
             host="#{RedisHost}#"
             port="6380"
             accessKey="#{RedisKey}#"
             ssl="True"
             connectionTimeoutInMilliseconds="15000" 
             operationTimeoutInMilliseconds="1000"
             retryTimeoutInMilliseconds="3000"
        />

Some findings from the metrics in Azure:

It looks like cache's server load has been steadily increasing since Sept. 13, before shooting up to 100% yesterday. Before that, there were spikes followed by a return to baseline:

image

Cache latency shows a similar trend:

image

By comparison, traffic to my web app has followed a consistent pattern for the past month, with peaks during US business times on weekdays, and the usual amount of traffic today:

image

The cache's Total Operations and Cache Hits follow a similar trend to the web app, though they seem to have doubled today:

image

image

Max CPU and Connections Created, which have been fairly flat all month, suddenly shot way up today.

image

image

Are there any insights I can draw from these stats as to what the root cause could be?

juliankay commented 10 months ago

I'm seeing exactly the same issue with Redis on Azure. Looks like it started on or around the 27th of September. Might be a coincidence but Microsoft also launched some new SKUs for Redis around the same time. đź‘€

Edit - also seeing very similar graphs to you.

JLRishe commented 10 months ago

@juliankay Thank you for weighing in on this. Can you tell me where I can find the information you're referring to regarding new SKUs?

NickCraver commented 10 months ago

cc @philon-msft

juliankay commented 9 months ago

@JLRishe here's an announcement of some of the new SKUs. May be a coincidence, but it suggests to me that some changes are going on inside of their infrastructure. https://azure.microsoft.com/en-gb/updates/public-preview-additional-cache-sizes-for-azure-cache-for-redis-enterprise/

philon-msft commented 9 months ago

Most likely, these are occurrences of a known issue impacting smaller C0 and C1 size caches in Azure. A recent change in a security monitoring process running on cache VMs is causing the process to gradually consume more and more CPU cycles as the size of system event log files grow. We have a fix that will roll out over the next few weeks. Meanwhile, you can open an Azure support request to have the Azure Cache team apply a mitigation to any cache being impacted.

JLRishe commented 9 months ago

@philon-msft Thank you very much. I will file a support ticket as you suggest.

I would very much appreciate it if someone could report back to let us know when the actual fix is ultimately rolled out.