Nordix / hiredis-cluster

C client library for Valkey/Redis Cluster. This project is used and sponsored by Ericsson. It is a fork of the now unmaintained hiredis-vip.
BSD 3-Clause "New" or "Revised" License
86 stars 42 forks source link

Crash while performing redisClusterAsyncFree() #222

Closed rahulKrishnaM closed 2 weeks ago

rahulKrishnaM commented 3 months ago

Hi, came across this crash during one of the runs.

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007e36c41fe3a2 in selectNode (nodes=<optimized out>)
[Current thread is 1 (Thread 0x7e36b50b3a40 (LWP 21))]
(gdb) bt
#0  0x00007e36c41fe3a2 in selectNode (nodes=<optimized out>)
#1  updateSlotMapAsync (acc=acc@entry=0x58a8403266c0)
#2  0x00007e36c41fe598 in clusterSlotsReplyCallback (ac=<optimized out>, r=0x0, privdata=0x58a8403266c0)
#3  0x00007e36c420ce79 in __redisRunCallback (reply=0x0, cb=0x7fffa203cfc0, ac=0x58a84f4e8780)
#4  __redisAsyncFree (ac=0x58a84f4e8780)
#5  0x00007e36c420d645 in redisAsyncFree (ac=<optimized out>)
#6  0x00007e36c41f9e90 in freeRedisClusterNode (node=0x58a87df71bc0)
#7  0x00007e36c41f7c56 in _dictClear (ht=0x58a85c542180)
#8  dictRelease (ht=0x58a85c542180)
#9  0x00007e36c41fbfc6 in redisClusterFree (cc=0x58a84218b300)
#10 0x00007e36c41ff555 in redisClusterAsyncFree (acc=0x58a8403266c0)
### layers below called from application while shutting down.

Looks like the execution was at this line number when the crash happened

https://github.com/Nordix/hiredis-cluster/blob/a2f1b825dda8f1438bd0df93880185d1e879da99/hircluster.c#L3878

I am suspecting the reason for SEGV could be the below code area:

https://github.com/Nordix/hiredis-cluster/blob/a2f1b825dda8f1438bd0df93880185d1e879da99/dict.c#L179

where, even though we are freeing up the memory pointed by redisClusterNode* or dictEntry, we are not unlinking it from the table. Hence it gets looked up later on(during selectNode()), which would be a dangling pointer.

        if ((he = ht->table[i]) == NULL)
            continue;
        while (he) {
            nextHe = he->next;
            dictFreeEntryKey(ht, he);
            dictFreeEntryVal(ht, he);      // memory released, but entry not set to NULL, so ht->table[]->val holds a dangling pointer ?
            hi_free(he);                   // memory released, but entry not set to NULL, so ht->table[] holds a dangling pointer ?
            ht->used--;
            he = nextHe;
        }

And also, other thing that looks odd is the library trying to reattempt discovery when the redisClusterAsyncContext object is getting deleted. Maybe we should prevent that path altogether, in case of redisClusterAsyncFree() where it's about to tear everything down anyways.

cc: @bjosv

rahulKrishnaM commented 3 months ago

I don't see that address being explicitly set to 0 after memory is freed. Do you agree with the analysis?

rahulKrishnaM commented 3 months ago

Posting some more thoughts, selectNode() is trying to do a dictNext(). So, if we don't set the he object to NULL once we free it inside _dictClear(), any later dictNext() access might still think the entry as valid and try to process that (i.e when the execution is still inside _dictClear()).

rahulKrishnaM commented 3 months ago

maybe something in lines of:

    while (he) {
        nextHe = he->next;
   +    ht->tabl[i] = nextHe;
        dictFreeEntryKey(ht, he);
        dictFreeEntryVal(ht, he);
        hi_free(he);
        ht->used--;
        he = nextHe;
    }