StackExchange / StackExchange.Redis

General purpose redis client
https://stackexchange.github.io/StackExchange.Redis/
Other
5.87k stars 1.51k forks source link

Pipelining causes NOAUTH errors with External Authentication on Envoy #2785

Closed dbarbosapn closed 4 days ago

dbarbosapn commented 1 week ago

Hello,

We have a situation where our Redis client is returning NOAUTH during the connection flow, but if we manually connect to it through redis-cli and send an AUTH <password> command, from that exact same machine, we get OK and can send other commands after.

There is no place where we can see the actual result of the AUTH command sent by the SDK since, from what we've seen in code, it's a fire and forget message. Any ideas how to debug this?

Important Note We have multiple redis clusters that we're connecting to. If we put this cluster first, it works. If we put it last, it doesn't!

For example, this doesn't work:

ConnectionMultiplexer.Connect(options1);
...
ConnectionMultiplexer.Connect(options2);
...
ConnectionMultiplexer.Connect(options3);
...
ConnectionMultiplexer.Connect(options4);

But this does:

ConnectionMultiplexer.Connect(options4);
...
ConnectionMultiplexer.Connect(options2);
...
ConnectionMultiplexer.Connect(options3);
...
ConnectionMultiplexer.Connect(options1);

We're wondering if somehow the multiplexer might be using a wrong connection from a completely different instance. If there is some kind of static or thread-local state.

Some logs:

Connecting (sync) on .NET 8.0.8 (StackExchange.Redis: v2.8.0.27420)
<redacted>.internal.office.com:6380,syncTimeout=5000,connectTimeout=17000,abortConnect=False
<redacted>.internal.office.com:6380/Interactive: Connecting...
<redacted>.internal.office.com:6380: BeginConnectAsync
1 unique nodes specified (with tiebreaker)
<redacted>.internal.office.com:6380: OnConnectedAsync init (State=Connecting)
Allowing 1 endpoint(s) 00:00:17 to respond...
Awaiting 1 available task completion(s) for 17000ms, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=0,Free=32767,Min=4,Max=32767), POOL: (Threads=6,QueuedItems=0,CompletedItems=404,Timers=28)
Configuring TLS
TLS connection established successfully using protocol: Tls13
<redacted>.internal.office.com:6380/Interactive: Connected
<redacted>.internal.office.com:6380: Server handshake
<redacted>.internal.office.com:6380: Authenticating (password)
<redacted>.internal.office.com:6380: Sending critical tracer (handshake): PING
<redacted>.internal.office.com:6380: Flushing outbound buffer
<redacted>.internal.office.com:6380: OnEstablishingAsync complete
<redacted>.internal.office.com:6380: Starting read
Response from <redacted>.internal.office.com:6380/Interactive / PING: Error: NOAUTH Authentication required.
<redacted>.internal.office.com:6380: OnConnectedAsync completed (Disconnected)
All 1 available tasks completed cleanly, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=2,Free=32765,Min=4,Max=32767), POOL: (Threads=6,QueuedItems=0,CompletedItems=422,Timers=27)
Endpoint summary:
<redacted>.internal.office.com:6380: Endpoint is (Interactive: Connecting, Subscription: Disconnected)
Task summary:
<redacted>.internal.office.com:6380: Returned, but incorrectly
Endpoint Summary:
<redacted>.internal.office.com:6380: Envoyproxy v6.0.0, primary; 1 databases; keep-alive: 00:01:00; int: Connecting; sub: n/a; not in use: DidNotRespond
<redacted>.internal.office.com:6380: int ops=2, qu=0, qs=0, qc=0, wr=0, socks=2; sub ops=0, qu=0, qs=0, qc=0, wr=0
<redacted>.internal.office.com:6380: Circular op-count snapshot; int: 0+2=2 (0.20 ops/s; spans 10s); sub:
Sync timeouts: 0; async timeouts: 0; fire and forget: 0; last heartbeat: -1s ago
Resetting failing connections to retry...
AuthenticationFailure on <redacted>.internal.office.com:6380/Interactive, Flushed/ComputeResult, last: PING, origin: SetResult, outstanding: 0, last-read: 0s ago, last-write: 0s ago, keep-alive: 60s, state: ConnectedEstablishing, mgr: 9 of 10 available, last-heartbeat: never, last-mbeat: 0s ago, global: 0s ago, v: 2.8.0.27420
slorello89 commented 1 week ago

Are you putting your password in the ConfigurationOptions? because it should call the AUTH automatically.

dbarbosapn commented 1 week ago

Yes, we are. You can even see on the logs it shows "Authenticating (password)".

This is really weird

slorello89 commented 1 week ago

Maybe try giving each of your ConfigurationOptions a unique SocketManager? That is a shared resource unless you overwrite it (though I wouldn't think it would matter because presumably all of your clusters are at different endpoints).

dbarbosapn commented 1 week ago

Yes all clusters are different endpoints... Also maybe important information is that we see the AUTH call incoming on server-side, and we know it is authenticated successfully. Makes no sense how it would show NOAUTH after. But we cannot see the response the client is seeing, since it's not logged by the StackExchange.Redis SDK 😢

dbarbosapn commented 1 week ago

Using separate socket manager instance didn't fix the issue. Same exact symptoms (logs look the same as well).

mgravell commented 1 week ago

Ok, this is very weird! To confirm, there should be no ambient state in play here, so I'm very confused. I'll ping you on Teams (myself and Nick are also MSFT) - it may be easier to get to the bottom of this directly rather than on GitHub (and obviously any fixes/knowledge-gained can still work their way back here).

mgravell commented 1 week ago

Just updating the public logs; envoy proxy is in play here; by adding RESP logging, it looks like we're seeing something very odd;

outbound:

AUTH whatever
PING

(where the PING is a critical tracer to check that we think the server is behaving properly)

inbound:

+OK
-NOAUTH

But with the weird timing complication; the other servers in the timing matrix are not behind envoy. Investigations continue.

dbarbosapn commented 1 week ago

Updated the title and updating here with what we think is the root cause so far. (Thanks a lot @mgravell and @NickCraver for the help building a repro!)

Even though Envoy connections are single-threaded by nature, when external authentication (new feature, more info here) is enabled, the authentication is done asynchronously.

This means that if the AUTH command is pipelined with the PING command, we will have exactly the repro we've seen.

mgravell commented 1 week ago

Emphasis: fundamentally this is an envoy bug. They must not process commands out of order (even if they buffer the response until appropriate) if the connection state is relevant, and here: the connection state is relevant. Is this logged with envoy? If not, I'm happy to log it.

dbarbosapn commented 1 week ago

FYI: https://github.com/envoyproxy/envoy/issues/35983

mgravell commented 4 days ago

Envoy fix merged; closing - thanks for the report; between us we got to the bottom of it!