Grokzen / redis-py-cluster

Python cluster client for the official redis cluster. Redis 3.0+.
https://redis-py-cluster.readthedocs.io/
MIT License
1.1k stars 316 forks source link

Add cursors parameter to scan_iter method #460

Open EugeniuZ opened 3 years ago

EugeniuZ commented 3 years ago

Hi,

This is a feature request to be able to resume a previously started scan for the database with a large set of keys removed. With the current implementation, the key scanning starts from the beginning resulting (internally) in the traversal of deleted keys as well. The background for the request: https://github.com/redis/redis/issues/9090 To take advantage of the extra parameter it is necessary to return somehow the cursor values together with the key (e.g. for the client to print the cursor values before interrupting the key scan.

Regards, Eugeniu

Grokzen commented 3 years ago

@EugeniuZ Have they updated the SCAN methods in redis-py that has caused the implementation in here to diverge away from redis-py? As the re-implementations here in this lib is based on what redis-py provides as feature so any new command, option or similar always must start there to be implemented to work with a single redis node and then (if it is needed) the re implementation can be done here in redis-py-cluster if there is further cluster optimizations or fixes that is needed.

Also if i read the original post on redis, i don't really get what you are after as i am not that well versed in the SCAN method and how it is supposed to work, i know there is tons of issues in general with the scan method and how it works in a clustered environment when you have failovers and other things to think about and adding this kind of feature on top of it just makes it more complex to implement properly.

You have to either make a PR to implement this as you think you need to implement this, because i do not really get what needs to be added here to provide you with what you need tbh

Grokzen commented 3 years ago

If i think it is what you are after, it is either one of two things.

One, it would mean to change the return value out from SCAN to perhaps be a tuple object like (last-used-cursor, value) but that is not a way forward as that both diverges away from the API you have with the SCAN method from redis-py so anyone using it would have to change their code to implement this pattern.

Two, add in some kind of argument just to support your need to have it print out, the problem here is that we take a much bigger stance on what should happen on the inside of the method in the case we want it printed. Not everyone wants it printed out to stdout as most code base might not even use this in a cli script so that would never really help them. They might want the option one instead so they can get it back via code and use the value in next request or in a retry scenario in a web server for example where a print would be useless as there are no interactive users working with the code.

Either of these solutions is really something good that i want to implement here in some way so i dont really know how to move forward with what you asked for tbh

EugeniuZ commented 3 years ago

Hi @Grokzen ,

I'll explain the use case (I think you already got it):

The solution here would be to supply the value of the cursor (in the cluster setup - cursors) to start from where I've left off.

In the standard python client it is possible to do so as they return the cursor.

I think scenario 1 is the way to go except don't change the current method, just provide another one:

Hope this clarifies the context. I'd also say this is very specific use case, which not that many users probably encounter. I've finished the data migration btw. Still this might be useful for myself and others using your library in the future.

Regards, Eugeniu

Grokzen commented 3 years ago

@EugeniuZ I dont see in your link where they return the cursor, what they do is yield from data which is not the cursor itself. Technically you do not really need to provide a new separate method for this case, you can just implement a kwarg switch to provide a opt-in solution for this altered return behavior.

mganesh commented 2 years ago

Not sure if redis-py was recently updated. Here is the latest which returns cursor.

https://github.com/redis/redis-py/blob/master/redis/client.py#L378

def parse_scan(response, **options):
    cursor, r = response
    return int(cursor), r
h4shr4t3 commented 2 years ago

Hi @Grokzen ,

I'll explain the use case (I think you already got it):

  • I use scan_iter() to go through the keys of a relatively large redis cluster (>700M keys).
  • While processing the keys (in batches) I also delete them from the redis cluster
  • At some point I realize I need to stop the script (unrelated to the redis client)
  • Next time I start the script same logic is invoked but I notice there is no activity in the client. I've checked the network activity and it was pretty active. In my original ticket the devs confirmed that the range of deleted keys is indeed iterated upon -> which explains the network activity I observe.

The solution here would be to supply the value of the cursor (in the cluster setup - cursors) to start from where I've left off.

In the standard python client it is possible to do so as they return the cursor.

I think scenario 1 is the way to go except don't change the current method, just provide another one:

  • existing users are not affected
  • no overhead (and output clutter) for printing the values is necessary as well.

Hope this clarifies the context. I'd also say this is very specific use case, which not that many users probably encounter. I've finished the data migration btw. Still this might be useful for myself and others using your library in the future.

Regards, Eugeniu

Hi Eugeniu, Can you tell me how you scan_iter through all this keys batching?