Redis is loading the dataset in memory

keith-z commented 6 years ago

redis version: 3.2.6

problem: In big-dataset scene, when the master restart in short time, sentinel think the master is available because master give "LOADING" as "PING"'s reply. the failover will not trigger because sentinel detect the restart of master, this will make service unavailable utill master finish the loading. And redis-py-cluster will hang on "Redis is loading the dataset in memory" in a long period of time, about 15 minutes.

question: should client handle the exception "Redis is loading the dataset in memory"?

Grokzen commented 6 years ago

If this problem is a common problem for a non cluster and a cluster, i would say that it should be dealt with inside redis-py and not here. Also i would need some stack trace or some position in the code where the problem of the "stuck" loading part is happening. Also, what would be your suggestion on how the client should "handle" this case? All i see is that either it should block until the cluster is available, or a timeout should trigger and the user has to decide what to do in this timeout case?

keith-z commented 6 years ago

Thank you for your reply. My environment is a cluster and I'm not sure whether this happens in a non cluster environment.My client(redis-py-cluster) got this exception "<class 're dis.exceptions.BusyLoadingError'> - Redis is loading the dataset in memory".Have you met this exception before? I think whether shoud i catch this exception and select a new available redis cluster node in client(redis-py-cluster)?

Grokzen commented 6 years ago

I have never seen that kind of error and it was not part of the reference client or i have not seen any other client handle that kind of error. I would expect it to be a very rare case where one gets that error. I would suggest that you at least try to handle it in your stack. It feels like it is working as expected from a redis-py pov, but i am not really sure how to handle it on my side right now.

If possible can you post the entire stack trace of the error for future reference? and can you post the stats of your cluster? masters/slaves setup/ratios, number of keys, primary data structure etc? i would like to understand what size of a cluster we are talking about.

theanti9 commented 6 years ago

I have seen this issue many times also. For a little more detail from what I've seen, we have a 14 node cluster, 7 master and 7 slave nodes, each node has about 100 GB of data. Slots are evenly distributed. Most keys are hashes or sorted sets.

The problem arises when a failover happens. If a master node fails (or is blocked for too long) and is switched to a slave, it dumps it's memory and reloads it from the new master. Loading the rdb it downloaded from the master at that data size can take 10-15 minutes. While it's dumping the memory it's not available, but while it's loading, the driver starts routing traffic to it, but any requests will throw that BusyLoadingError. So in our case, whenever a node fails over, we have 15 minutes where any read requests routed there fail.

I've not seen this behavior with non-cluster redis (using sentinel) because I think sentinel waits until it's own pings to stop getting the busy loading error before letting it be used again. I have seen this in non-cluster redis when not using sentinel, though, and just doing dumb manual routing.

The simplest solution would be to treat a busy loading error almost like a moved response, and to simply then try the request on another node with the same slot until it works or you've exhausted the nodes that might contain the data.

Fwiw, this isn't much of an issue if individual nodes are small. If each node is only a few GB, redis will load the data quick enough that you'll only see a few or none of these errors. But large nodes (50-100 GB+) that take a long time to load the dataset will see many of these for a long time. And while I'd love it to be a rare case, it's not entirely uncommon. Happens probably once a month or more for us.

akrylysov commented 5 years ago

Experiencing the same issue after a failover, +1 for treating BusyLoadingError the same way as MovedError.

Grokzen commented 4 years ago

@theanti9 @akrylysov @keith-z Do any of you have any ideas on how i can simulate this myself locally w/o having to make new clusters with tens of gigabytes of data that i would have to recreate each time?

Also i did a little bit of theory thinking about this error, and what i am thinking right now is that if the node will be down 10-15 minutes in some big scale scenarios, how would moving the BusyLoadingError into being handles like a MovedError really help? If you get into the MovedError code path then you will try to reinitialize the cluster and probably try the same node again and get BusyLoadingError again. And do that loop a few times over and you would reach TTL exhaustion and then you are back to having an exception raised back to your code that uses the client class just that it was packaged into another excpetion class. Or is it the case that if you try enough of times that if you are doing a READ style command that you will accidentally and maybe thorugh trying random nodes get to one node that can read the data?

Or are you suggesting that the client will basically wait forever or until a timeout period has been reached before waiting for BusyLoadingError should be ignored/stopped and a fail exception should be thrown, basically same as TimeoutError.

Right now i probably wont make the change for this into 2.1.0 release, but i am willing to make it happen quick for 2.2.0 release if we can get this issue moving again.

theanti9 commented 4 years ago

@Grokzen I don't think I have a good suggestion here for the simulation. Though theoretically even with a relatively small amount of data (1gb?) If you hit enough reads while this happens you could probably trigger it. But I'm not sure.

This specifically happens when a replica is what's syncing from the master. In a 1 master -> 1 replica setup treating it like a Moved response would result in requesting it from the master which would be pretty much guaranteed to succeed. In a 1 master -> N replica scenario, it would be less likely to guarantee success to choose other nodes at random. but I'm not sure exactly as I've not run into that scenario.

I think when there's multiple replicas, in theory you'd only have one that's resyncing, the one that was previously the master, so I think it'd still have a decent chance of success. But that's not a guarantee either as other things can always fail. at least not guaranteed on the first try. In either scenario, the master should be able to service the read, however, so there should be at least one node able to in all cases, I think.

Grokzen commented 4 years ago

@theanti9 Just to be clear so i understand the causality of events, basically i just send CLUSTER FAILOVER to a slave node and it will start to replicate over from the master to the slave node what it has and when having a lot of data that would cause that error to be thrown. I think i actually saw a busyloadingerror earlier today when i was debugging another issue so i might be able to make it happen again manually with very little data.

Also i might be a little confused, but ain't it the master that is throwing the BusyLoadingError? Or is it only the replica that is the master is failing over to that is causing the problem and throwing this error? Are you doing this operation with some read replica setup that would cause the client to attemt to send to the replica over the master?

But what is your take on the other part about the timeout problem? Yes in most cases just moving on like a MovedError does and let it retry for a few attempts might solve very small clusters, but it would not solve your case unless you modify your TTL to some really high number to allow for your case where you have 10-15 minutes of failure. And since there is no sleep involved int he MovedError either, the TTL will just blast through the code attempting it up to 16 times and then just throw BusyLoadingError again. Is a configuration option the way to go with some kind of configurable time delay in how many times it should do the retry be the way to go?

theanti9 commented 4 years ago

Sorry for not adding that initially, but yes I think a TTL of sorts, either time or attempt based would be a good idea. Or a like "Max redirects".

I should clarify what I believe is the issue here: only reads issued to the replica node that is reloading are the requests that fail. As far as I'm aware, requests to the master during that time still succeed. Though if you are able, I would definitely recommend testing that. Unfortunately, I've moved companies and no longer work directly with redis and haven't experienced this in a while, so my memory may be a bit fuzzy.

Given that, a relatively low redirect or timeout is probably reasonable since you shouldn't have to try nodes more than once as at least one should be able to service requests. You could even potentially use the standard request timeout.

Grokzen commented 4 years ago

Funny thing tho, while i was trying to reproduce the BusyLoadingError, i actually found out a new bug with readonly mode that if you do a failover multiple of times between one master and one slave, after the second failover and forward, the client stops sending reads to the replica. But that is not really related to this issue and i will have to sort that one out separately.

Grokzen commented 4 years ago

Right now the problem i am facing when i am trying to track down what read commands that would fail with BusyLoadingError i realise that most of those commands if they are targeting the replica as it's natural code path it should go down in, that there is poor error handling or simply bad node handling etc in there overall. Also i get into some weird corner cases with a few commands that i have tried to make better where if you run KEYS in the cluster variant it will attempt to send the command to all nodes in the cluster to fetch all keys, masters and replicas. So in that instance even if a replica is pushing out LoadingError, doing a retry where only going to the masters for data won't work really with how the main loop logic is setup right now. But in cases where you run normal commands but under READONLY mode where read commands is sent to replicas for reading, that case can be applied to go back to the master for that certain command during that time.

I think i need to sit down and sketch down some way to rebuild the overall main loop logic as that is something that is starting to show and to crack as more and more use-cases is added on.

Just as a simple stop-gap messure i do think that i can alter the BusyLoadingError a little bit to what was proposed earlier in this thread, i think that is doable and might just help to remove a few instances and cases of problems. But there would most likley still be edge cases that would still pop up what the major overhaul would probably need to deal with later on.

Grokzen commented 4 years ago

@theanti9 I think i actually found a new feature that i did not know about O.o apparently redis has a DEBUG command that provides me the ability to instead of sending the real command to a node and try to simulate the response i want form the server, i can just send DEBUG ERROR LOADING fake message as the command i want to the server, and it will return the correct data back to me. So basically just sending r.execute_command('DEBUG', 'ERROR', 'LOADING fake message') would simulate the error you are having and i can work on this error much smoother :)

Grokzen / redis-py-cluster

Redis is loading the dataset in memory #244