Grokzen / redis-py-cluster

Python cluster client for the official redis cluster. Redis 3.0+.
https://redis-py-cluster.readthedocs.io/
MIT License
1.1k stars 316 forks source link

Best practice for deletion from Redis #513

Closed iDataist closed 1 year ago

iDataist commented 1 year ago

Deleting from ordered set is slow and the pipeline tends to give up. What are the best practices for deleting members from millions of keys in the ordered set? Is chunking the best solution so that the pipeline can handle it? Any advice for improving the code would be greatly appreicated.

BTW, the logging setting is to remove the verbose Redis logging at the INFO level. If there are better ways to achieve the same result, please let me know.

Time complexity: zadd: O(log(N)) for each item added, where N is the number of elements in the sorted set hset: O(1) for each field/value pair added, so O(N) to add N field/value pairs when the command is called with multiple field/value pairs zrem: O(M*log(N)) with N being the number of elements in the sorted set and M the number of elements to be removed hdel: O(N) where N is the number of fields to be removed

Versions: python==3.8.12 redis-py-cluster==2.1.3 redis==3.5.3

Python script:

import logging
from functools import partial, partialmethod
import boto3
logging.TRACE = 31
logging.addLevelName(logging.TRACE, 'TRACE')
logging.Logger.trace = partialmethod(logging.Logger.log, logging.TRACE)
logging.trace = partial(logging.log, logging.TRACE)
logging.getLogger("rediscluster.connection").setLevel(logging.WARNING)
from rediscluster import RedisCluster

redis = RedisCluster(startup_nodes=[{"host": cluster,"port": "6379"}], 
                     decode_responses=True,
                     skip_full_coverage_check=True,
                     ssl=True,
                     username="******",
                     password="******",)

# Adding
pipe = redis.pipeline()
for user_id, tag in user_tag_dict.items():
    if tag in user_tag:
        ukey = f"{{{item_type}}}|u|{user_id}"
        ikey = f"{{{item_type}}}|i|{item_id}"
        pipe.zadd(
            name=ukey,
            mapping={item_id: timestamp},
        )
        pipe.hset(ikey, int(user_id), "new")
for tag in user_tag:
    key = f"{{{item_type}}}|s|{tag}"
    pipe.zadd(name=key, mapping={item_id: timestamp})
pipe.zadd(
    name="default",
    mapping={item_id: timestamp},
)
pipe.execute()

# Removing
pipe = redis.pipeline()
for user_id in pipe.hkeys(f"{{{item_type}}}|i|{item_id}"):
    pipe.zrem(f"{{{item_type}}}|u|{user_id}", item_id)
    pipe.hdel(f"{{{item_type}}}|i|{item_id}", user_id)
for tag in user_tag:
    pipe.zrem(f"{{{item_type}}}|s|{tag}", item_id)
pipe.zrem("default", item_id)
pipe.execute()
Grokzen commented 1 year ago

On those levels of complexity and sizes, i would either rebuild your key and grouping mechanism so that i could just delete a single key to delete all items, if your data structure allows for it ofc. But otherwise, you are lefter with manually chunking out the data in smaller batches one after another. Python can't handle to big data set:s anyway so say to limit yourself to 10000 or 100000 items per batch. Also note that since you running sucha big deletes, you tend to block either a single node while deleting, or the entire cluster if you delete keys across multiple slots. I would do very small batches like 1000 or 10000 keys at one go, then you give the cluster time to settle between each batch and you give it time to process other keys and to re balance and cluster sync as well.

Also in the future, please use the Discussion tab when asking general questions, Issues is for real issues and i would not classify this as a code issue per say. General help is done in the Discussion tab

iDataist commented 1 year ago

Thank you @Grokzen for your advice. It's very helpful.

I will post general help in Discussion tab in the future.