StackExchange / StackExchange.Redis

General purpose redis client
https://stackexchange.github.io/StackExchange.Redis/
Other
5.85k stars 1.5k forks source link

Feature Request: DB selection support for proxy-bridged cluster #2642

Closed baronwangr closed 4 months ago

baronwangr commented 5 months ago

Description: Some of the Proxy-bridged cluster can support multiple db selection. In order to give full support of advanced features, the proxy-bridged cluster worked as a single node cluster that:

in that mode, it can give the compatibility at best effort and user could write same code for standalone mode or cluster mode, especially when user scales standalone mode to cluster, with db selection data inside.

An unhandled timeout exception was encountered when doing db selection in cluster mode in some early version of the library, and we tried the latest package 2.7.17 but also failed.

Unhandled exception. StackExchange.Redis.RedisTimeoutException: Timeout performing HSET (10000ms), inst: 0, qu: 0, qs: 0, aw: False, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0, serverEndpoint: 172.123.35.71:6379, mc: 1/1/0, mgr: 10 of 10 available, clientName: 47e4280ce570, PerfCounterHelperkeyHashSlot: 11933, IOCP: (Busy=0,Free=1000,Min=32,Max=1000), WORKER: (Busy=0,Free=32767,Min=32,Max=32767), v: 2.2.79.4591

the case can be reproduced by using code below,

        // singleton connection
        ConnectionMultiplexer cm = RedisConnSingleton.getRedisConn();
        var db = cm.GetDatabase();

        for (int i = 1; i <= 100; i++)
        {
            string key = i.ToString();
            string value = i.ToString();

            // after switch db selection, the GetDatabase returned a unexpect handle so failed to send
            db.GetDatabase(i); 

            db.StringSet(key, value);
            String ret = db.StringGet(key);
            Console.WriteLine("get key: " + ret);

            Thread.Sleep(1000);
        }    

Thanks, Baron

mgravell commented 5 months ago

Some of the Proxy-bridged cluster can support multiple db selection.

Which, exactly? You say something can be repro'd, but: we can't repro anything without knowing what server configuration is required. Is this redislabs/redis-cluster-proxy (which seems to be pretty idle in terms of GitHub)? Or is this some other setup that spoofs the redis-cluster feature on top of a different shard model? How does this cluster bridge identify itself? This is an important question because "normal" cluster doesn't support multiple databases, so we disable the feature in the client.

Perhaps alternatively; if everything is going to be routed via the proxy, maybe we don't even need to know.that it is in cluster mode - deferring all of that to the proxy, in which case we need to know how it identified itself simply so we can ignore the cluster aspect.

However, the most interesting thing for me here is that you say you're getting a timeout when changing database, but: I would expect GetDatabase to fail if we detected cluster mode - so this seems to be a different failure mode, and not one that the library is forcing. Alternatively, if we didn't identify a cluster, I would expect everything to work already.

But: I can do precisely zero to investigate any of this without knowing exactly what server/proxy setup you have here.

baronwangr commented 5 months ago

Hi, thanks for the quick response! It could be late at night... Alibaba cloud Redis supports proxy mode, which begins from 2.8 in static sharding mode (earlier) and now a bridge for native cluster, it can support multi db selection well.

Most of the Redis SDK doesn't support standalone-cluster auto-nego well so cloud vendors are spending times on it. It looks that AWS is supporting unsharded-cluster mode to support user scale their Redis recently, meanwhile alibaba uses proxy to scale.

The issue happened when user scale a standalone mode to proxy-bridged cluster. It worked like:

  1. The cloud vendor, provide master-replica pair of Redis for standalone mode
  2. The Redis pair has a Virtual IP (allocated from user's virtual switch) so user could access the VIP:6379(for example) to access the standalone Redis. (Redis master's service IP:Port is using NAT to bind to VIP:6379)
  3. when user are submit a scale, the managing system is producing a new cluster, and begin to transfer data to the cluster (with proxies).
  4. when data transfer is done, trying to set RO mode or pause the write client to wait for the in-flight write to finish. When it is done, it rebinds the VIP:6379 to proxies(NAT with Load-balance mode). so the scale is done.

For user land, when VIP is switched, all links to VIP will be killed, so user app with SDK will encounter a link bounce.

We are still investigating the timeout issue when switch is done, current debug shows when switch to a cluster mode, some exceptions could be handled inside silently so user get a timeout. It is still unknown that if the client is re-configured into a cluster mode because in wireshark we saw "SELECT" went out and "+OK" returned.

Yes I totally agree that the rejection of SELECT in cluster mode is also a good option with fast failure. In test we found that SELECT can be sent but not well handled. @yangbodong22011(Jedis Reviewer) is working on it now, we will provide test environment.

It is also hoped that the db selection in cluster should not be a vendor specific SDK feature, we just try to find a way to enable it gracefully in cluster mode. Proxy can be fixed to adapt to auto-negotiation way, which we can propose to the Redis community. @soloestoy

We will keep you updated about the debugging, hope you get us well.

Thanks again.

lyq2333 commented 5 months ago

@mgravell Hi, sorry for the bother. I think maybe I find the reason. The variable Source is in InternalFailure state but the function TryPushMessageToBridgeSync returns WriteResult.Success in ExecuteSyncImpl. Finally, source will wait until syncTimeout and return a timeout error.

image

The reason for this is that we catch the exception but return WriteResult.Success in WriteMessageToServerInsideWriteLock. image

I'm new here. Is this by design or a bug?

mgravell commented 5 months ago

If result was Success but we failed internally, then that is odd and probably a bug; however, I'm not sure we actually wrote the select, because the inner exception message shown comes from the code that would have generated that command, however: without a test environment to test against, it is very hard to investigate directly. For this to work, we would need some way to identify that this is a server that supports this usage.

If someone can provide a server endpoint to test against, we can say more.

yangbodong22011 commented 5 months ago

@mgravell I sent you the test environment address via slack, please check it.

mgravell commented 5 months ago

Thanks for providing a suitable validation environment; this isn't a "normal" redis configuration, and a reasonable argument could be made that this is a bug in their proxy (it shouldn't positively respond to cluster nodes, IMO), but: it now works!

mgravell commented 4 months ago

fix merged; should have new release soon; preview is on myget, 2.7.20 or higher from nuget feed https://www.myget.org/F/stackoverflow/api/v3/index.json

baronwangr commented 4 months ago

It passed our test and work well now. Thank @mgravell for the support!