Nordix / hiredis-cluster

C client library for Valkey/Redis Cluster. This project is used and sponsored by Ericsson. It is a fork of the now unmaintained hiredis-vip.
BSD 3-Clause "New" or "Revised" License
87 stars 42 forks source link

Time taken to return to application from hiredis library seems to be too long (>60sec) #142

Closed rahulKrishnaM closed 1 year ago

rahulKrishnaM commented 1 year ago

Hi,

Encountered a case wherein the processing seems to be stuck at hiredis-cluster library and not returning to application for more than 60seconds. There is a mechanism we have in place to check if a thread is taking longer to report a heartbeat, which is getting missed in this case, and results in an exception being thrown.

This is the backtrace from the flow:

<hearbeat missing for long time resulting in a signal/crash in above frames>
#7  0x00007eff1914d90f in poll () from /lib64/libc.so.6
#8  0x00007eff1ab5a2ba in redisContextWaitReady (c=c@entry=0x26bac60, msec=msec@entry=2000) at net.c:250
#9  0x00007eff1ab5a991 in _redisContextConnectTcp (c=c@entry=0x26bac60, addr=<optimized out>, port=<optimized out>, timeout=<optimized out>,  source_addr=<optimized out>) at net.c:504
#10 0x00007eff1ab5aaa5 in redisContextConnectBindTcp (c=c@entry=0x26bac60, addr=<optimized out>, port=<optimized out>, timeout=<optimized out>, source_addr=<optimized out>) at net.c:544
#11 0x00007eff1ab592d7 in redisConnectWithOptions (options=options@entry=0x7ffc201f96e0) at hiredis.c:818
#12 0x00007eff1ab406b4 in ctx_get_by_node (cc=cc@entry=0x2859000, node=node@entry=0x2b085f0) at hircluster.c:1993
#13 0x00007eff1ab409b1 in node_get_which_connected (cc=cc@entry=0x2859000) at hircluster.c:2052
#14 0x00007eff1ab4990a in cluster_config_get (config_name=0x7eff1ab4a7ad "cluster-node-timeout", config_value_len=<synthetic pointer>, cc=0x2859000) at hircluster.c:2072
#15 redisClusterAsyncRetryCallback (ac=0x2821dc0, r=0x0, privdata=0x2b1e270) at hircluster.c:3926
#16 0x00007eff1ab55e09 in __redisRunCallback (reply=0x0, cb=0x7ffc201f9820, ac=0x2821dc0) at async.c:288
#17 __redisAsyncFree (ac=0x2821dc0) at async.c:310
#18 0x00007eff1ab566b9 in __redisAsyncDisconnect (ac=<optimized out>) at async.c:402
#19 0x00007eff1ab56765 in __redisAsyncHandleConnectFailure (ac=0x2821dc0) at async.c:604
#20 __redisAsyncHandleConnect (ac=ac@entry=0x2821dc0) at async.c:617
#21 0x00007eff1ab56e5e in redisAsyncHandleRead (ac=0x2821dc0) at hiredis/async.c:655
#22 0x00007eff1960f243 in AdapterEvents::redisLibmyeventHandler (this=0x26bb8c0, fd=20, events=29, info=...) at hiredis_libmyevent.h:89
<application trying to do a socket read in below frames>

From what I could gather from the backtrace and frames, this is the flow of events:

  1. application attempts a read event as part of epoll event seen on the socket.
  2. As part of redisAsyncHandleRead() operation, hiredis tries to connect() to a specific redis server node as part of __redisAsyncHandleConnect().
  3. connect() fails with error: "Connection timeout", which triggers a __redisAsyncFree() operation which tries to clean up all outstanding commands (for which response is not received) to that particular redis server node.
  4. There seems to be around 128 outstanding commands in the obuf of this particular redis server node at this point.
  5. hiredis runs a while loop over all these outstanding commands and makes callbacks with empty reply to the hiredis-cluster/application.
  6. hiredis-cluster processing these callbacks as part of redisClusterAsyncRetryCallback(), keeps track of failure responses, and it sees 5 failing responses, and decides to relearn the topology after a specified timeout("cluster-node-timeout") configured at the server.
  7. For fetching "cluster-node-timeout" value from the redis server, hiredis-cluster frames a new command and tries to connect to one of the previously discovered nodes (added below decoded from the frames). 192.168.65.161, 192.168.121.79, 192.168.235.38, 192.168.221.225, 192.168.2.202, 192.168.165.198, 192.168.121.12, 192.168.165.182
  8. hiredis-cluster follows the above order in attempting connection to the redis server ips, and in doing so it (redisConnectWithOptions()) performs a blocking connect() call to each of the server ips.
  9. The time where the hearbeattimeout occurred, hiredis-cluster was attempting to connect to the 6th ip (192.168.165.198). So, the previous 5 attempts had failed in connecting and could account for a long time since the calls are all blocking.

I assume the analysis above holds good here. If this is the case, is there a way or suggestion to improve this? Do you see any other way of doing this that it doesn't get stuck at hiredis-cluster in processing.

@bjosv

bjosv commented 1 year ago

This relates with #19 and the conditions haven't changed since. I guess you have a pretty low connect timeout already?

rahulKrishnaM commented 1 year ago

Ya connect-timeout is set to 2seconds currently.

rahulKrishnaM commented 1 year ago

So, a blocking connect() call will effectively see a max wait time of connect_timeout value, right? And, if a poll() follows a connect(), i guess there too we will see a max of connect_timeout wait time. So, per redis server-ip there could be a 2*connect_timeout wait time in an error case.

bjosv commented 1 year ago

I'm not following you regarding the following poll()? The poll() in hiredis is the one that handles timeout I though.

I guess the amount of outstanding callbacks add-on time, depending on max-retry-count. If max-retry-count is higher then fewer callbacks triggers a config-get, so that might be a remedy..

rahulKrishnaM commented 1 year ago

I'm not following you regarding the following poll()? The poll() in hiredis is the one that handles timeout I though.

what I meant was this piece of code, which follows after a failed connect() call. https://github.com/redis/hiredis/blob/master/net.c#L266

rahulKrishnaM commented 1 year ago

I'm not following you regarding the following poll()? The poll() in hiredis is the one that handles timeout I though.

what I meant was this piece of code, which follows after a failed connect() call. https://github.com/redis/hiredis/blob/master/net.c#L266

Hope I made this clear @bjosv

bjosv commented 1 year ago

Update: I'm currently trying out a change for using async reconnection attempts. Hopefully this will fix this problem, but I need to try it out a bit more.

rahulKrishnaM commented 1 year ago

Thanks for the update @bjosv . Making the reconnection attempts async would really help in improving this wait period. Shall wait for the changes to retry the scenario.

bjosv commented 1 year ago

A change covering this issue is now delivered, hope it fixes the problems in your setup as well.

rahulKrishnaM commented 1 year ago

Nice! Thanks, bjosv, shall try pulling this and see how it goes.