Add in a retry queue for clusterOps

HoustonPutman commented 11 months ago

Resolves #560

Fix the last remaining to-dos for safe lockable cluster operations. (Rolling Upgrade, Scale Up, Scale Down)

Make the locks expire after a certain time (10 minutes), but when they expire they are put in a retryQueue.
Other clusterOps will be run while its in the queue, but when no other clusterOps have to run, it will be pulled back off the queue.
If a newer version of a clusterOp is available, the queued one should be abandoned in favor of the newer command. (scale up 3 ->5, queued, then scale down 5 -> 3, the scale up should not be continued if a scale down is necessary)

TODO:

[x] Retry Queue Implementation
[x] Replace ClusterOp with a newer version of the same op, if one exists
[x] Integration tests
[x] Documentation for user expectations
[x] Change log

HoustonPutman commented 11 months ago

@radu-gheorghe If you are interested!

radu-gheorghe commented 11 months ago

My 2 cents:

good stuff, I think a queue is needed, thanks for implementing it 🙏
I think it's awesome that you added docs about how things work!
Do we need to retry operations? I'm thinking that if we just dropped operations that e.g. timed out or errored out, the reconcile loop will naturally retry, no? Especially in the context of HPA, I bet it re-evaluates whether the current cluster has enough horsepower
Along the same lines, is there a priority that's linked to different kinds of requests? For example, if we had a scaleUp operation and a scaleDown operation came about, I would assume that whoever came last should win. And that maybe a rolling upgrade (if started) should ideally complete before we start scaling, otherwise we might get in trouble?

Either way, I can't wait to try this autoscaling in more complex setups 🤓 Exciting stuff!

HoustonPutman commented 11 months ago

Do we need to retry operations? I'm thinking that if we just dropped operations that e.g. timed out or errored out, the reconcile loop will naturally retry, no? Especially in the context of HPA, I bet it re-evaluates whether the current cluster has enough horsepower

So yes, in a lot of cases, the operator will notice that the state of the cluster is not in the correct place if an operation is stopped. (e.g. if you are doing a scale down, the cluster won't be scaled down. If you are doing a rolling restart, not all pods will be up-to-date). If these clusterOps get deleted, the Solr Operator will re-create the clusterOp because the SolrCloud is not in the correct state.

However there are some operations that don't have their state as readily-available to the operator. For example in a scale-up operation, the first step is to scale up the cluster to the new number of pods. Then the operator will call a BalanceReplicas command. If that fails, and the operation is dropped, the operator does not know that the cluster is imbalanced (because the StatefulSet has the correct # of pods). Therefore it doesn't know that it should try to redo the balanceReplicas command again. This is the perfect use case for a "retry" queue.

Also in the case that we have more cluster operations in the future, a queue lets us know the operations that we have tried and are waiting to do in the future. Example: We have cluster operations A, B and C, and all need to take place. (and they have the ordering preference of A -> B -> C). Both A & B are failing, and need C to occur first before A & B can succeed. If we don't have a queue to know what is waiting to run in the future, then we will always flip between A & B. (A fails -> so skip it and start B. B fails -> pick the next cluster operation which is A. We have no clue that A was failing before, so we have no reason to not choose to run it next.) C will never get a chance to run. If we have the queue, then A will fail, and be added to the queue. Then we will go through the necessary operations, skipping A because it is queued and B will be chosen. When it fails, it will be added to the queue. Then the operator will go through the necessary operations skipping A & B because they are queued. C will be chosen to run, and succeed. Now A will be retried. When it succeeds, B will be retried.

Along the same lines, is there a priority that's linked to different kinds of requests? For example, if we had a scaleUp operation and a scaleDown operation came about, I would assume that whoever came last should win...

Yes, absolutely! So basically if the scaleUp happens first, and we have started the operation, we at least need to wait until its in a stoppable state (which the retryQueue already does). Once it can be stopped, we go through and see if any other operations need to take place while it is stopped. The operator will see that a scaleDown needs to occur, and since scale operations override eachOther, the queued scaleUp will be replaced by the new scaleDown operation. So the scaleDown will ultimately win, it just has to wait because we need to make sure that data isn't in transit while we are switching from the scaleUp to the scaleDown.

... And that maybe a rolling upgrade (if started) should ideally complete before we start scaling, otherwise we might get in trouble

Once again, yes! So the rolling upgrade comes first in the list of things to take care of. So if at the exact same time a user increases the podCount of the SolrCloud and changes something about the podTemplate, then the rolling restart will take precedence over the ScaleUp.

However a rolling restart can take a while, so the "timeout" might happen, which will give the scaleUp a chance to start in the middle. Maybe we only actually "queue" the operation for later if the Solr Operator encounters an error. If it sees no error, then it won't queue the operation.

OR even better idea. there are 2 timeouts:

1 minute for an operation that sees an error
10 minutes for an operation that does not see an error

Once again these are "soft" timeouts, so they will eventually be retried. But the 10 minutes gives us a better guarantee of order-of-operations, as you mentioned, for operations that are running without issue.

I'll look into doing this ^ now. Good call out.

radu-gheorghe commented 11 months ago

Thanks for explaining! It makes sense. I actually saw somewhere in your patch that you only allow one scaleUp/Down.

Down the road, maybe timeouts can be configurable? I'm thinking that especially the one for operations that don't see an error is important (with a long default) because in a large cluster, rebalancing can take hours.

HoustonPutman commented 10 months ago

Down the road, maybe timeouts can be configurable? I'm thinking that especially the one for operations that don't see an error is important (with a long default) because in a large cluster, rebalancing can take hours.

I think that's a good idea, but definitely for the future!

apache / solr-operator

Add in a retry queue for clusterOps #596