nodeNext strange behaviour after master failover.

dembekadam commented 1 year ago

We use nodeNext to monotor all the master in redis cluster . Every several seconds execute code

nodeIterator ni; redisClusterNode *node; initNodeIterator(&ni, this->_redisClusterAsyncContext->cc); while ((node = nodeNext(&ni)) != NULL ) { redisClusterAsyncCommandToNode(this->_redisClusterAsyncContext, node, onClusterHeartBeat, &pingCallbackRespTab[i], "PING"); }

normally it work fine and sends PING to master in each shard and when there is master failover it starts sending PING to new master.

But on larger cluster with 7 shards and large number of updates. We see that nodeNext does not return all nodes or returns slave nodes instead of master. I would expect this to happen once or twice when routing table is updated is but after master fail over this strange behavior of nodeNext continue until we will not refresh _redisClusterAsyncContext. We call initNodeIterator before each try so its should have correct number of nodes but somehow data returned by nodeNext does not match the list of node returned by CLUSTER NODES command in redis-cli

dembekadam commented 1 year ago

I tried to understand the behavior and added logs dumping :

redisClusterContext struct dict nodes; / Known redisClusterNode's */
redisClusterContext redisClusterNode *table; / redisClusterNode lookup table */
Data returned by nodeNext

Stared with 10.254.2.71:6379@16379 master 10.254.69.109:6379@16379

and rebooted the master so 10.254.69.109 became new master

before reboot logs shows only 10.254.2.71 nodes table [ 0 ] key 10.254.2.71:6379 redisClusterNode table [ 5461 ] name d67adf50be847591c7676fb1a41cfce1da30da8c addr 10.254.2.71:6379 role 1 fail cnt 0 The nodeNext showed IP of 3 redis master

after failover the redisClusterNode *table; / redisClusterNode lookup table */ was updatd and started to show new IP redisClusterNode [ 5461 ] name 7d4ae46bf5ab2390b1ab12dc8367a750b478ccf2 addr 10.254.69.109:6379 role 1 fail cnt 0

but struct dict *nodes; was still showing slave IP nodes table [ 0 ] key 10.254.2.71:6379

And nodeNext now was showing 3 master + 1 slave IP

What is the purpose of both nodes and table in redisClusterContext and is this ok that nodes contains still slave IP ?

zuiderkwast commented 1 year ago

Thanks for the report!

I don't know why we have both a list and a table. Maybe it is not needed actually. Hopefully we can investigate this in the next week.

dembekadam commented 1 year ago

Thanks We are debugging some problem with redis node failover. With small number of updates per second and 3 shards hiredis-cluster always automatically connects to new master. But on cluster with 7 shards and big traffic we see that hiredis-cluster detected that connection to old master was lost and sends CLSUTER NODES command that has reply with new master for given slot range but when using tcpdump we dont see any attempts for client to try to connect to newly selected master . ( no TCP SYNC or AUTH command sent to new master). Thats why I suspect some problem with parsing CLUSTER NODES and updating hiredis-cluster internal nodes tables for cluster with more shards.

If you will have any other suggestions what to check if hiredsi-cluster is not sending AUTH to new master after CLUSTER NODES response it would be very useful.

zuiderkwast commented 1 year ago

It's good that you are able to reproduce the problem.

Some ideas:

Adding some printf or logging in hiredis-cluster is always useful for debugging. If you suspect it's an error parsing CLUSTER NODES, I would put some printf in parse_cluster_nodes() and parse_cluster_nodes().
Run CLUSTER NODES in a terminal using redis-cli and look the output to see if there is anything special there.
To check if it is about the number of shards or the amount of traffic per second, you can try with 3 shards and big traffic or with 7 shards and small traffic.

zuiderkwast commented 1 year ago

Also note that connection to a new master doesn't happen directly when CLUSTER NODES is handled. Connection happens only when there there is a command for that node (lazy connect).

SzymonWojtkiewicz commented 1 year ago

I'm encounter the same issue. In addition I have noticed that after master frailer new TCP connection is open to its slave (new master of a shard) but that is the only communication between them. No DB requests are being send.

dembekadam commented 1 year ago

Yes I noticed that connection to newly selected master is only during command.

We keep calling redisClusterAsyncFormattedCommand with keys that should go to this new master but the tcpdump does not show any attempt to really send them and there is no callback called so the process memory keeps growing.

Im thinking of workaroung to count how many request we sent without response and when it starts growing call redisClusterReset

dembekadam commented 1 year ago

We suspect the problem could be in actx_get_by_node it calls redisAsyncConnectWithOptions and immediately calls redisAsyncCommand(ac, NULL, NULL, "AUTH %s %s") not waiting for redisAsyncConnectWithOptions callback.
Should the AUTH be called only in callback passed to redisAsyncConnectWithOptions ? in tcpdump we se TCP sync sent to new master but AUTH not send.

We added test sleep 10ms between redisAsyncConnectWithOptions and redisAsyncCommand(AUTH) and it seams more clients was able to reconnect but still not all.

dembekadam commented 1 year ago

Looks like the problem was in libae that was used by hiredis asynchronous functions.

https://github.com/redis/hiredis/blob/master/adapters/ae.h

The AE must be initialized with specific size aeCreateEventLoop(1024); and each redis asynchronous function opens temporary fd to send command. When application starts number of fd is low and they dont exceed AE max size. But when application opens many other connections and then redis master fail over happen it could get assigned fd > max size and command is silently doped

int aeCreateFileEvent(aeEventLoop eventLoop, int fd, int mask, aeFileProc proc, void *clientData) { if (fd >= eventLoop->setsize) { errno = ERANGE; return AE_ERR; }

In https://github.com/redis/hiredis/blob/master/adapters/ae.h return code from aeCreateFileEvent is ignored and then command is silently dropped.

Nordix / hiredis-cluster

nodeNext strange behaviour after master failover. #175