Crash in actx_get_by_node()

SS-TruMinds commented 1 year ago

Hello,

We are using hiredis-cluster Release 0.8.1, and this issue is with regards to a crash we see in our performance testing.

We use async hiredis APIs, and invoke them both from the application and also from hiredis callbacks. Our application is multi-threaded, and hiredis context object & hiredis API calls are mutex protected.

We noticed the following crash a few hours into our test:

Crash reason: SIGSEGV /0x00000080 Crash address: 0x0 Process uptime: not available

Thread 13 (crashed) 0 libhiredis_cluster.so.0.8!actx_get_by_node [hircluster.c : 3630 + 0x0] 1 libhiredis_cluster.so.0.8!redisClusterAsyncFormattedCommand [hircluster.c : 4117 + 0x8] 2 libhiredis_cluster.so.0.8!redisClustervAsyncCommand [hircluster.c : 4244 + 0x16] 3 libhiredis_cluster.so.0.8!redisClusterAsyncCommand [hircluster.c : 4258 + 0x5]

Kindly help resolve this crash.

Thank you.

bjosv commented 1 year ago

Interesting, do you have any ideas what events that lead up to this? Any timeouts, topology changes or other logs that can give a lead?

SS-TruMinds commented 1 year ago

Unfortunately no. We have been having issues with adapter timeouts on the client side & also server failures, but exact trigger for this crash is not known.

rahulKrishnaM commented 1 year ago

Hi @bjosv Have some updates from the core analysis. Posting it here. (All data picked from either frame 0 or 1.)

It looks like both table and slots pointers are holding corrupted data at the time of crash. From the hiredis code, the node calculation is done as table[slot_number]. slot_number as seen from the core file is 8535.

So,

(gdb) p acc,cc.table[8535] $223 = (cluster_node *) 0x1c00550

which is same as the corrupted cluster_node pointer seen at the time of crash. So, the corrupted node is pulled from the corrupted table data.

However, the nodes dictionary object (of type struct dictEntry) seems to be holding valid data.

(gdb) p *acc.cc.nodes $54 = {table = 0x1c53ed0, type = 0x7fe3b53c53a0 , size = 8, sizemask = 7, used = 7, privdata = 0x0}

(gdb) ptype acc->cc->nodes->table type = struct dictEntry { void key; void val; struct dictEntry *next; } **

(gdb) p (char *)cc.nodes.table[0].key $231 = 0x1bce9b1 "185.185.185.185:1111@16379"

(gdb) p acc.cc.nodes.table[0].next $43 = {key = 0x1c16551, val = 0x1fd79b0, next = 0x0} (gdb) p (char )cc.nodes.table[0].next.key $232 = 0x1c16551 "240.240.240.240:1111@16379"

(gdb) p (char *)cc.nodes.table[1].key $225 = 0x1cecee1 "213.213.213.213:1111@16379"

(gdb) p cc.nodes.table[2] $12 = (dictEntry *) 0x0

(gdb) p (char *)cc.nodes.table[3].key $226 = 0x1d183b1 "17.17.17.17:1111@16379"

(gdb) p cc.nodes.table[4] $13 = (dictEntry *) 0x0

(gdb) p cc.nodes.table[5] $14 = (dictEntry *) 0x0

(gdb) p cc.nodes.table[6] $15 = (dictEntry *) 0x0

(gdb) p (char *)cc.nodes.table[7].key $227 = 0x1c62ff1 "123.123.123.123:1111@16379"

(gdb) p (char *)cc.nodes.table[7].next.key $228 = 0x1f413c1 "247.247.247.247:1111@16379"

(gdb) p acc.cc.nodes.table[7].next.next $50 = (struct dictEntry ) 0x20da370 (gdb) p acc.cc.nodes.table[7].next.next $51 = {key = 0x1d06911, val = 0x1c552b0, next = 0x0} (gdb) p (char *)cc.nodes.table[7].next.next.key $229 = 0x1d06911 "26.26.26.26:1111@16379"

So, there are total 7 nodes available at the time of crash as indicated by the nodes dictionary.

All 7 cluster_node (valid) pointers are decoded below.

(gdb) p (cluster_node )acc.cc.nodes.table[0].val $122 = (cluster_node ) 0x1f9d540 (gdb) p (cluster_node )acc.cc.nodes.table[0].next.val $123 = (cluster_node ) 0x1fd79b0 (gdb) p (cluster_node )acc.cc.nodes.table[1].val $124 = (cluster_node ) 0x1c55740 (gdb) p (cluster_node )acc.cc.nodes.table[3].val $125 = (cluster_node ) 0x1f73cc0 (gdb) p (cluster_node )acc.cc.nodes.table[7].val $126 = (cluster_node ) 0x207a1d0 (gdb) p (cluster_node )acc.cc.nodes.table[7].next.val $127 = (cluster_node ) 0x1c162b0 (gdb) p (cluster_node )acc.cc.nodes.table[7].next.next.val $128 = (cluster_node ) 0x1c552b0 (gdb) p $122 $129 = {name = 0x1f37a03 "5d51fc9a44b9c4dc166aa607aa3ab96a79c88c57", addr = 0x1bb9b81 "185.185.185.185:1111@16379", host = 0x1beacb1 "185.185.185.185", port = 1111, role = 1 '\001', pad = 0 '\000', failure_count = 0, con = 0x0, acon = 0x0, slots = 0x1c41ba0, slaves = 0x0, migrating = 0x0, importing = 0x0} (gdb) p $123 $130 = {name = 0x1c82833 "b4906a5eb2a9c8e8398c5b87202d2d8b360ec1f5", addr = 0x1f233d1 "240.240.240.240:1111@16379", host = 0x1dcdfb1 "240.240.240.240", port = 1111, role = 1 '\001', pad = 0 '\000', failure_count = 0, con = 0x0, acon = 0x0, slots = 0x1c55e10, slaves = 0x0, migrating = 0x0, importing = 0x0} (gdb) p $124 $131 = {name = 0x1bb8253 "dc74c7bab5209309369cbd2109bba446863d9231", addr = 0x1bbc1d1 "213.213.213.213:1111@16379", host = 0x1bbc071 "213.213.213.213", port = 1111, role = 1 '\001', pad = 0 '\000', failure_count = 0, con = 0x0, acon = 0x0, slots = 0x22089e0, slaves = 0x0, migrating = 0x0, importing = 0x0} (gdb) p $125 $132 = {name = 0x2510b13 "736b141a059a95839f02fa75af6449faf3f9e295", addr = 0x1d369c1 "17.17.17.17:1111@16379", host = 0x1c4edb1 "17.17.17.17", port = 1111, role = 1 '\001', pad = 0 '\000', failure_count = 0, con = 0x0, acon = 0x0, slots = 0x1e98030, slaves = 0x0, migrating = 0x0, importing = 0x0} (gdb) p $126 $133 = {name = 0x2af1f03 "21b0aaa3abe526da3adf201c9138a3cef1bdd824", addr = 0x1c01151 "123.123.123.123:1111@16379", host = 0x21892e1 "123.123.123.123", port = 1111, role = 1 '\001', pad = 0 '\000', failure_count = 0, con = 0x0, acon = 0x0, slots = 0x1e181d0, slaves = 0x0, migrating = 0x0, importing = 0x0} (gdb) p $127 $134 = {name = 0x1de2443 "19145935ddaa3e4a9d1f984876ac114634fb0622", addr = 0x2047f51 "247.247.247.247:1111@16379", host = 0x1c15311 "247.247.247.247", port = 1111, role = 1 '\001', pad = 0 '\000', failure_count = 0, con = 0x0, acon = 0x7fe3340164a0, slots = 0x21ab1f0, slaves = 0x0, migrating = 0x0, importing = 0x0} (gdb) p *$128 $135 = {name = 0x1c46013 "64c9ac4a8aa04e3033f4abba019f3ffe3964cae2", addr = 0x1c1b6e1 "26.26.26.26:1111@16379", host = 0x1bb83b1 "26.26.26.26", port = 1111, role = 1 '\001', pad = 0 '\000', failure_count = 0, con = 0x0, acon = 0x7fe3900045b0, slots = 0x260a260, slaves = 0x0, migrating = 0x0, importing = 0x0} (gdb)

Each cluster_node holds information of the slots they serve. Below information captures slots information of each of the above 7 nodes.

(gdb) p $122.slots $211 = (struct hilist *) 0x1c41ba0

(gdb) ptype $122.slots type = struct hilist { listNode head; listNode tail; void (dup)(void ); void (free)(void ); int (match)(void , void ); unsigned long len; } * (gdb)

(gdb) p $122.slots $136 = {head = 0x1dee720, tail = 0x1dee720, dup = 0x0, free = 0x7fe3b53b75e0 , match = 0x0, len = 1} (gdb) p (cluster_slot )$122.slots.head.value $140 = {start = 5462, end = 8192, node = 0x1f9d540} (gdb) p $123.slots $142 = {head = 0x1c47810, tail = 0x1c47810, dup = 0x0, free = 0x7fe3b53b75e0 , match = 0x0, len = 1} (gdb) p (cluster_slot )$123.slots.head.value $143 = {start = 13655, end = 16383, node = 0x1fd79b0} (gdb) p $124.slots $144 = {head = 0x1d77af0, tail = 0x1d77af0, dup = 0x0, free = 0x7fe3b53b75e0 , match = 0x0, len = 1} (gdb) p (cluster_slot )$124.slots.head.value $145 = {start = 2731, end = 5461, node = 0x1c55740} (gdb) p $125.slots ======> ALL empty. So this slots object is not holding any shards $146 = {head = 0x0, tail = 0x0, dup = 0x0, free = 0x7fe3b53b75e0 , match = 0x0, len = 0} (gdb) p $126.slots $147 = {head = 0x1d35290, tail = 0x1d35290, dup = 0x0, free = 0x7fe3b53b75e0 , match = 0x0, len = 1} (gdb) p (cluster_slot )$126.slots.head.value $148 = {start = 10924, end = 13654, node = 0x207a1d0} (gdb) p $127.slots $149 = {head = 0x2149970, tail = 0x2149970, dup = 0x0, free = 0x7fe3b53b75e0 , match = 0x0, len = 1} (gdb) p (cluster_slot )$127.slots.head.value $150 = {start = 8193, end = 10923, node = 0x1c162b0} (gdb) p $128.slots $151 = {head = 0x2c6b6c0, tail = 0x2c6b6c0, dup = 0x0, free = 0x7fe3b53b75e0 , match = 0x0, len = 1} (gdb) p (cluster_slot *)$128.slots.head.value $152 = {start = 0, end = 2730, node = 0x1c552b0} (gdb)

slots to cluster_node mapping from above data ==>

0-2730 node = 0x1c552b0 8193-10923 node = 0x1c162b0 10924-13654 node = 0x207a1d0 2731-5461 node = 0x1c55740 13655-16383 node = 0x1fd79b0 5462-8192 node = 0x1f9d540

So, it's indicating 6 valid slot ranges. So, there is 1 extra node ($125 from gdb), which is not having any slots distributed to it.

One more interesting observation is that contiguous slots of the corrupted table object seem to be holding same cluster_node address (even though it is holding invalid data) as how it should be when range of slots are assigned to multiple cluster_nodes.

(gdb) x/16384gx cc.table 0x2dd68d0: 0x0000000001bb80f0 0x0000000001bb80f0 0x2dd68e0: 0x0000000001bb80f0 0x0000000001bb80f0 0x2dd68f0: 0x0000000001bb80f0 0x0000000001bb80f0 0x2dd6900: 0x0000000001bb80f0 0x0000000001bb80f0 . . 0x2df6890: 0x0000000001bcd120 0x0000000001bcd120 0x2df68a0: 0x0000000001bcd120 0x0000000001bcd120 0x2df68b0: 0x0000000001bcd120 0x0000000001bcd120 0x2df68c0: 0x0000000001bcd120 0x0000000001bcd120 (gdb)

6 unique cluster_node addresses came up (added below) for the entire 16384 slots, indicating there were 6 nodes available at some point of time (which is 7 at the time of crash).

0x0000000001bb80f0 0x0000000001e82c60 0x0000000001c49130 0x0000000001c00550 --> cluster_node pointer which crashed. 0x0000000001beb2b0 0x0000000001bcd120

Not really sure what path it took for the corruption to happen. Please share any thoughts you have on the same.

bjosv commented 1 year ago

Just as a reference, what cluster_slots do you have in cc->slots? Does it match the ones found in $140 to $152?

bjosv commented 1 year ago

An also what called redisClusterAsyncCommand in (ie. frame 4+):

Thread 13 (crashed)
0 libhiredis_cluster.so.0.8!actx_get_by_node [hircluster.c : 3630 + 0x0]
1 libhiredis_cluster.so.0.8!redisClusterAsyncFormattedCommand [hircluster.c : 4117 + 0x8]
2 libhiredis_cluster.so.0.8!redisClustervAsyncCommand [hircluster.c : 4244 + 0x16]
3 libhiredis_cluster.so.0.8!redisClusterAsyncCommand [hircluster.c : 4258 + 0x5]

Was it from within a response callback, and in that case, was it a NULL response?

rahulKrishnaM commented 1 year ago

@bjosv No, that data looks all corrupted.

(gdb) p cc $208 = {err = 0, errstr = '\000' <repeats 127 times>, flags = 0, connect_timeout = 0x1bce800, command_timeout = 0x1bfe520, max_retry_count = 5, username = 0x0, password = 0x0, nodes = 0x1c49a50, slots = 0x1cc0ca0, route_version = 3, table = 0x2dd68d0, requests = 0x0, retry_count = 0, need_update_route = 0, update_route_time = 0, ssl = 0x0, ssl_init_fn = 0x0} (gdb) p cc->slots $209 = (struct hiarray ) 0x1cc0ca0 (gdb) p *cc.slots $210 = {nelem = 6, elem = 0x213e400, size = 8, nalloc = 6}

(gdb) p (cluster_slot*)(cc.slots->elem) $212 = (cluster_slot*) 0x213e400 (gdb) p *$212 $213 = {start = 32733440, end = 0, node = 0x1dfc5b0} (gdb) p (cluster_slot*)(cc.slots->elem + 8) $214 = (cluster_slot*) 0x213e408 (gdb) p *(cluster_slot*)(cc.slots->elem + 8) $215 = {start = 31442352, end = 0, node = 0x1bb83d0} (gdb) p *(cluster_slot*)(cc.slots->elem + 8*2) $216 = {start = 29066192, end = 0, node = 0x1bef460} (gdb) p *(cluster_slot*)(cc.slots->elem + 8*3) $217 = {start = 29291616, end = 0, node = 0x1d57580} (gdb) p *(cluster_slot*)(cc.slots->elem + 8*4) $218 = {start = 30766464, end = 0, node = 0x1bead20} (gdb) p *(cluster_slot*)(cc.slots->elem + 8*5) $219 = {start = 29273376, end = 0, node = 0x40}

rahulKrishnaM commented 1 year ago

An also what called redisClusterAsyncCommand in (ie. frame 4+):

Thread 13 (crashed)
0 libhiredis_cluster.so.0.8!actx_get_by_node [hircluster.c : 3630 + 0x0]
1 libhiredis_cluster.so.0.8!redisClusterAsyncFormattedCommand [hircluster.c : 4117 + 0x8]
2 libhiredis_cluster.so.0.8!redisClustervAsyncCommand [hircluster.c : 4244 + 0x16]
3 libhiredis_cluster.so.0.8!redisClusterAsyncCommand [hircluster.c : 4258 + 0x5]

Was it from within a response callback, and in that case, was it a NULL response?

No, this was directly called from the application trying to send out a command to hiredis server.

bjosv commented 1 year ago

It is stated that you run multi-threaded and the question is in what way?

When a response is sent from a redis node the used eventsystem will notice that data is received on the socket, which will trigger a call to hiredis redisProcessCallbacks() that parses the reply and call hiredis-clusters callbacks. In a case where slots have moved, the MOVED response will trigger a fetch of the new topology and update the nodes dict and the slot-to-node table in a hiredis-clusters callback.

If the eventsystem is able to process replies in multiple threads at the same time (likely from different nodes) , there is a chance that the replies can run the callback that updates the dict and table at the same time. This might result in a inconsistency between the dict and table. The dict would probably be ok since an update of it is a change of pointer, but the table is built using a dict that suddenly is changed.

rahulKrishnaM commented 1 year ago

Hi @bjosv, all access into hiredis is restricted using 1 common mutex. So, any threads trying to call into hiredis would first acquire the lock and then make the hiredis command call. Once the command execution is completed, the lock is released, and another thread can acquire it and invoke its command call.

The eventsystem can only process 1 reply at a time (is always run in a single threaded context). It takes the lock before calling into hiredis for any read/write event which will then be executing the redisProcessCallbacks().

rahulKrishnaM commented 1 year ago

We were thinking of possibilities of how the above stack trace could be related to the one in ticket 124.

One theory which we came up in our internal discussion is that: Image a case where,

we start with "v1" nodes, slots and table
something triggers us to get into cluster_update_by_addr, which gets as far as updating nodes to "v2" before invoking callbacks - slots and table are still "v1"
we go through a path similar to that in the stack in ticket 124, calling back into the application library, which tries to send another command
we're lucky, and the node that we end up targeting has not yet been freed (or at least not yet corrupted), so we don't crash with the stack as seen in 124
however, instead, we end up deciding that we need to do cluster_update_by_addr again - this goes through and updates nodes to "v3", invokes any callbacks on "v2" nodes (of which there are unlikely to be any, because they've only just been created)
the inner cluster_update_by_addr completes and sets slots and table to "v3" - now, all of nodes, slots and table are "v3"
we unwind the stack, and end up back in the first cluster_update_by_addr, which finishes its processing by setting slots and table to "v2" - now, nodes is "v3" and slots and table are "v2"
now we wait until the next event, when we hit the crash in this ticket because table (which is where we do the lookup) is "v2" but nodes is "v3", so all table's pointers are out-of-date.

If the above scenario is a possibility, do you agree that we could end up in a case as above, and that both are linked ?

bjosv commented 1 year ago

I believe your scenario was a possibility and likely fixed by #125. Please re-open a issue if not.

Nordix / hiredis-cluster

Crash in actx_get_by_node() #113