Recovery in case of discovered slots from redis cluster is partial

rahulKrishnaM commented 11 months ago

During one of resiliency tests at redis server side, recently came across an issue where redisClusterAsyncCommandArgv calls kept failing continuously with REDIS_ERRreturn code after some of the server pods were restarted.

Looking at the server-side data, it seems like for a brief period of time some of the slots were missing at the server side and the overall slot count was less than what it had to be (10088 instead of 16384). The cluster slots command seems to have been triggered during this period in time and might have fetched this partial data from server.

It is a hypothesis (from code reading) that the failure in redisClusterAsyncCommandArgv could be happening here if the targeted slot for the command is falling in the missing range.

https://github.com/Nordix/hiredis-cluster/blob/0a4deb6ea78cb0b8008288346868a99d29d404d8/hircluster.c#L4136

the node pointer would be NULL at this point, and the api returns out with REDIS_ERR.

We don't ever recover from this scenario if we hit this, since we don't go again for rediscovering the slots and hold on to the already discovered partial slots. The query I have is how to handle/recover out of this scenario. Should this be handled in the library to maybe schedule a rediscovery if we find that the slot information is partial.

cc: @bjosv

bjosv commented 11 months ago

It sounds reasonable that the library should handle a re-discovery when the slotmap is partial. Currently it just gives the REDIS_ERR as you state, but maybe it should call throttledUpdateSlotMapAsync(acc, NULL); as well. I believe there is a need for a testcase for this scenario and some fix.

zuiderkwast commented 11 months ago

I agree, throttled update is a good idea for the async API.

For the sync API we don't have throttling, but I think we can try at every command.

Nordix / hiredis-cluster

Recovery in case of discovered slots from redis cluster is partial #191