StackExchange / StackExchange.Redis

General purpose redis client
https://stackexchange.github.io/StackExchange.Redis/
Other
5.85k stars 1.5k forks source link

Intermittent connection errors after MemoryDB cluster version upgrade #2650

Closed schickosky closed 2 months ago

schickosky commented 4 months ago

I have an application using AWS MemoryDB for storage which recently underwent a version upgrade from 6.2 to 7.1, and ever since have been seeing fairly frequent connection drops with the following message:

StackExchange.Redis.RedisConnectionException: InternalFailure on insights-silo-memorydb-cluster-0001-002.insights-silo-memorydb-cluster.g748ly.memorydb.us-west-2.amazonaws.com:6379/Interactive, Idle/Faulted, last: SSCAN, origin: ReadFromPipe, outstanding: 17, last-read: 0s ago, last-write: 0s ago, unanswered-write: 0s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 9 of 10 available, in: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.7.17.27058
 ---> System.InvalidOperationException: Unexpected response prefix: {
    at StackExchange.Redis.PhysicalConnection.TryParseResult(ResultFlags flags, Arena`1 arena, ReadOnlySequence`1& buffer, BufferReader& reader, Boolean includeDetilInExceptions, ServerEndPoint server, Boolean allowInlineProtocol) in /_/src/StackExchange.Redis/PhysicalConnection.cs:line 2118
   at StackExchange.Redis.PhysicalConnection.ProcessBuffer(ReadOnlySequence`1& buffer) in /_/src/StackExchange.Redis/PhysicalConnection.cs:line 1859
   at StackExchange.Redis.PhysicalConnection.ReadFromPipe() in /_/src/StackExchange.Redis/PhysicalConnection.cs:line 1803
   --- End of inner exception stack trace ---

The error only seems to occur during set scans, and appears to cause the connection to drop briefly for the client. We do also have sorted set scans being read from/written to even more frequently, but I do not see the error from that command. The data being retrieved is large JSON blobs stored as strings, and our workload is very write-heavy (we are continually popping elements off the sets to work in memory to then re-persist, even when there is nothing to be done). Suspected the data may have become corrupted due to being written/deleted during the upgrade (which was eventful and took 2 days), but I can successfully scan the entire set from redis-cli. I have not been able to reproduce the error outside of our production environments, however.

We are currently using 2.7.20, and I've tried rolling back a few versions to no change. Changing protocol to Resp3 also has had no effect. The cluster has 6 shards, and I am not observing multiple clients connections failing concurrently even when there should be multiple clients working on the same shard. In our most active environment, we are seeing this error ~600 times per hour on average.

I also have a ticket open with MemoryDB, as it isn't clear to me where the issue is actually coming from. Hoping there may be a way to get more detail on the root cause from the library?

mgravell commented 4 months ago

The mention of } suggests that it is actually looking into the middle of one of the JSON payloads when it expects to be at a break between messages. So: I think there's two possible explanations here:

  1. there's a bug in the client causing it to screw up math, in particular with the individual redis frames, such that when it reads some message, it is calculating the end at the wrong place
  2. there's a bug in some proxy or the server, causing it to send an invalid payload

SCAN doesn't actually fetch values, so I'm guessing you're actually performing SCAN with a huge number of GET - is your redis-cli usage also doing this? if you're comparing with redis-cli --scan, that just does the SCAN - no GETs, so it won't actually be doing the same thing.

Unfortunately the data on the socket is very transient, and is notoriously hard to debug after the fact - and logging everything is prohibitively expensive. It genuinely wouldn't surprise me if the answer here is "2" - we have genuinely seen this before with other "not actually redis" servers (and in particular: proxies - especially when a proxy tries to auto-reconnect a broken socket mid-stream).

The "easiest" repro is for me to be able to debug directly against it, but I stress: I absolutely do not want to go anywhere near anyone's production systems (including my own) for debugging purposes.

I don't have an immediate "do this" suggestion, honestly.

It seems particularly relevant that this started during a server upgrade. I can't say anything with confidence, but it feels like this is most likely to be a server or proxy issue. I hope that doesn't seem like buck-passing.

schickosky commented 4 months ago

Yes, sorry, the operation is SSCAN to retrieve values from the set, not SCAN.

I do have a production environment with this error present that is currently seeing 0 usage for other reasons; this is the environment I checked with redis-cli as I am otherwise very much in agreement about debugging in prod. If there are any potentially unsafe things that could be useful, I can do so without much risk.

An issue with the MemoryDB setup does seem more likely, but I'm having trouble explaining redis-cli not exhibiting the same behavior. I may need to rig it up to run repeatedly for some hours to better mimic the actual workload.

mgravell commented 4 months ago

My apologies - you did say "set scan" - I simply misread.

So to confirm: in this set, the actual JSON payloads are the set contents? in that case, yes, SSCAN will be returning the payloads. In terms of differences: note that redis-cli is only going to be issuing the SSCAN operations - it won't be doing anything else. By contrast, SE.Redis may be interleaving other operations into the mix - admin operations, concurrent load, etc; it will be aggressively "pipelined", i.e. we will happily send a bunch of commands one after the other down the pipe before we've seen the first reply. That's meant to be fine and a normal usage of redis, but we have seen problems in "redis-like" servers that have failed when handling concurrency topics on this kind of usage. I can't say whether that is the actual cause, but it is at least a difference.

mgravell commented 4 months ago

Right; I have an idea, but it isn't a trivial amount of work; I think I can write a Tunnel implementation (an advanced configuration API) that writes all of the inbound data to a local file on the client. Let me have a look at that tomorrow. If we can do that, and if you have a reasonably reliable repro, we should have a snapshot of the entire inbound traffic. From there, if we can get a snapshot, we can write a parser that checks that file for validity. I can't remember (without checking) whether the Tunnel API is "before" or "after" the TLS layer, so: if your data is using TLS to the server we might need to disable that in the client and move it into the Tunnel.

schickosky commented 4 months ago

Yep, we're getting JSON from the set contents, and that distinction makes sense. It drove me to check - the error does list a last operation, and summing those for the last 12h indicates this isn't solely related to SSCAN at all, but appears to be distributed across all our commands. I think I had the impression it was solely related to SSCAN because the stack traces from our code are horrific. Instead, the data is thus:


@last count()
1936
SSCAN 4117
SREM 374
SADD 4542
GET 118
EXPIRE 298
ZADD 386
DECR 94
ZPOPMIN 161
INCR 120

I would absolutely appreciate anything you can do, but that sounds like a large undertaking. Hopefully the MemoryDB team will respond today and be able to shed some additional light. It does seem like the issue is more on their side than not - the service is strictly TLS, as well.

mgravell commented 4 months ago

I think it would be important to emphasize that the client is "pipelined" (they should know what this means; if they don't: well, that's a huge red flag) - if this is a server/proxy concurrency etc issue, this might help trigger a light bulb

schickosky commented 4 months ago

Following up - as far as can be observed the cluster is behaving nominally, and its logs are of no additional use. Suggestions from the service team are similar - add additional logging, or try with another client to narrow down the source of the issue.

The custom Tunnel implementation sounds like an extremely large lift, so I want to sanity check: I've been poking through the repo to see if there's any simpler way to get additional logging. Am I wrong in thinking that compiling the client with LOGOUTPUT would accomplish the goal, or does LoggingPipe do less than I'm thinking?

mgravell commented 4 months ago

Honestly I can't remember the last time I used LOGOUTPUT - no idea if it currently compiles as a feature flag (it was for a specific contextual purpose) - and I'd ideally prefer to get something we can drop in without needing recompiles. Haven't had chance to look, but I might have a moment tomorrow.

mgravell commented 4 months ago

See above PR; usage: https://github.com/StackExchange/StackExchange.Redis/blob/4b450ce557d52a05af42fbecf3fcc169f0e048ce/docs/RespLogging.md

does this seem viable to run with your scenario? TLS is supported, etc

schickosky commented 4 months ago

Amazing!

I can't see any reason it wouldn't work - my only unknowns are disk space and exfiltrating the log files. The latter is I think trivial, and I think if I guard activation of the logging tunnel on the 'safer' environment it won't matter if we write the ~20GiB it would take to be an issue.

mgravell commented 4 months ago

If you have (or can simulate) some workload that can help force this outcome (as hinted in the docs md), my hope is that ^^^ provides a mechanism for us to isolate whether this issue originates at library vs server.

schickosky commented 3 months ago

Apologies for the month gap, but the issue did reproduce in a dev environment while I was working other priorities.

Note for any future googlers: it seems that the LoggingTunnel never closes the files it opens (even when the ConnectionMultiplexer is disposed), so the only way I was able to get the logs off my container was to add additional commands after the application exits.

I can confirm that ValidateAsync returns Unhandled exception. System.InvalidOperationException: Invalid fragment starting at 72354 (fragment 8632) ---> System.InvalidOperationException: Unexpected response prefix: {, and ReplayAsync terminates with

> GET ...
< -Index was outside the bounds of the array.

I'd prefer to share full log files privately if possible, but the end of the files that fail to validate are thus (# of lines chosen arbitrarily):

.out

SADD
$99
PerformanceTest:machines:DimensionValue=J_GameRegion=us-west-2_GameStage=prod_GameVersion=unknown:1
$2132
{ ~ length json ~ }
*3
$6
EXPIRE
$99
PerformanceTest:machines:DimensionValue=J_GameRegion=us-west-2_GameStage=prod_GameVersion=unknown:1
$5
86400
*2
$6
UNLINK
$97
PerformanceTest:machines:DimensionValue=K_GameRegion=us-west-2_GameStage=prod_GameVersion=unknown
*5
$5
SSCAN
$99
PerformanceTest:machines:DimensionValue=X_GameRegion=us-west-2_GameStage=prod_GameVersion=unknown:0
$1
0
$5
COUNT
$3
250
*5
$5
SSCAN
$99
PerformanceTest:machines:DimensionValue=L_GameRegion=us-west-2_GameStage=prod_GameVersion=unknown:0
$1
0
$5
COUNT
$3
250
*5
$5
SSCAN
$99
PerformanceTest:machines:DimensionValue=W_GameRegion=us-west-2_GameStage=prod_GameVersion=unknown:0
$1
0
$5
COUNT
$3
250
*5
$5
SSCAN
$99
PerformanceTest:machines:DimensionValue=U_GameRegion=us-west-2_GameStage=prod_GameVersion=unknown:1
$1
0
$5
COUNT
$3
250

.in

_
+OK
:0
+OK
+QUEUED
+QUEUED
*2
+OK
:1
:1
:1
:1
*2
$1
0
*0
*2
$1
0
*0
*2
$1
0
*0
{ ~lengthy but incomplete json~

If I'm interpreting this correctly, the server is responding to a SSCAN with a 0-length array reply, but then sending an element anyways?

slorello89 commented 3 months ago

@schickosky - can you provide the length of the string advertised for the JSON string as well as the actual length of the JSON string you expect? e.g. you'll see in the result array of the SSCAN command $ before each bulk string:

SADD set foo bar baz
:3
SSCAN set 0
*2
$1
0
*3
$3
foo
$3
bar
$3
baz

the $3 preceding foo, bar, and baz indicates that the string response has a length of 3 before it terminates in a CRLF.

schickosky commented 3 months ago

I maybe haven't had enough caffeine yet but I don't see how I'd get that? I believe them being missing is part of the issue - we're not expecting any elements at all, yet JSON is being returned without any of the preceding metadata.

Seems like strong evidence the issue is on the server, if anything.

slorello89 commented 3 months ago

Ahh, I see, so basically the output just cuts off and dumps a bunch of JSON out.

Your interpretation seems to be 100% right, it's saying there's nothing to respond with but then shoveling a massive string back. Perhaps when there's a single large element it tries to respond with a bulk string instead of an array but misformats it?

Given the errors emerged when you upgraded from MemoryDB 6.2 to 7.1 (and assuming you cannot reproduce this in a Redis 7.0/7.2 instance?), it sounds like this is most likely on MemoryDBs end. It's a closed source fork of Redis so we have no real insight into what it's doing, you probably need to contact AWS's support.

schickosky commented 3 months ago

Seems likely. I have a ticket open with them I've been updating in parallel. I'll (try to carve out the time to) see if I can reproduce it with bare Redis as well since I expect I may be able to do that sooner than I get a response.

schickosky commented 2 months ago

Issue was resolved on the MemoryDB side. Thanks for the help investigating!