Closed HoustonPutman closed 10 months ago
@radu-gheorghe If you are interested!
My 2 cents:
Either way, I can't wait to try this autoscaling in more complex setups 🤓 Exciting stuff!
- Do we need to retry operations? I'm thinking that if we just dropped operations that e.g. timed out or errored out, the reconcile loop will naturally retry, no? Especially in the context of HPA, I bet it re-evaluates whether the current cluster has enough horsepower
So yes, in a lot of cases, the operator will notice that the state of the cluster is not in the correct place if an operation is stopped. (e.g. if you are doing a scale down, the cluster won't be scaled down. If you are doing a rolling restart, not all pods will be up-to-date). If these clusterOps get deleted, the Solr Operator will re-create the clusterOp because the SolrCloud is not in the correct state.
However there are some operations that don't have their state as readily-available to the operator. For example in a scale-up operation, the first step is to scale up the cluster to the new number of pods. Then the operator will call a BalanceReplicas
command. If that fails, and the operation is dropped, the operator does not know that the cluster is imbalanced (because the StatefulSet has the correct # of pods). Therefore it doesn't know that it should try to redo the balanceReplicas
command again. This is the perfect use case for a "retry" queue.
Also in the case that we have more cluster operations in the future, a queue lets us know the operations that we have tried and are waiting to do in the future. Example: We have cluster operations A, B and C, and all need to take place. (and they have the ordering preference of A -> B -> C). Both A & B are failing, and need C to occur first before A & B can succeed. If we don't have a queue to know what is waiting to run in the future, then we will always flip between A & B. (A fails -> so skip it and start B. B fails -> pick the next cluster operation which is A. We have no clue that A was failing before, so we have no reason to not choose to run it next.) C will never get a chance to run. If we have the queue, then A will fail, and be added to the queue. Then we will go through the necessary operations, skipping A because it is queued and B will be chosen. When it fails, it will be added to the queue. Then the operator will go through the necessary operations skipping A & B because they are queued. C will be chosen to run, and succeed. Now A will be retried. When it succeeds, B will be retried.
- Along the same lines, is there a priority that's linked to different kinds of requests? For example, if we had a scaleUp operation and a scaleDown operation came about, I would assume that whoever came last should win...
Yes, absolutely! So basically if the scaleUp
happens first, and we have started the operation, we at least need to wait until its in a stoppable state (which the retryQueue already does). Once it can be stopped, we go through and see if any other operations need to take place while it is stopped. The operator will see that a scaleDown
needs to occur, and since scale
operations override eachOther, the queued scaleUp
will be replaced by the new scaleDown
operation. So the scaleDown
will ultimately win, it just has to wait because we need to make sure that data isn't in transit while we are switching from the scaleUp
to the scaleDown
.
... And that maybe a rolling upgrade (if started) should ideally complete before we start scaling, otherwise we might get in trouble
Once again, yes! So the rolling upgrade comes first in the list of things to take care of. So if at the exact same time a user increases the podCount of the SolrCloud and changes something about the podTemplate, then the rolling restart will take precedence over the ScaleUp.
However a rolling restart can take a while, so the "timeout" might happen, which will give the scaleUp a chance to start in the middle. Maybe we only actually "queue" the operation for later if the Solr Operator encounters an error. If it sees no error, then it won't queue the operation.
OR even better idea. there are 2 timeouts:
Once again these are "soft" timeouts, so they will eventually be retried. But the 10 minutes gives us a better guarantee of order-of-operations, as you mentioned, for operations that are running without issue.
I'll look into doing this ^ now. Good call out.
Thanks for explaining! It makes sense. I actually saw somewhere in your patch that you only allow one scaleUp/Down.
Down the road, maybe timeouts can be configurable? I'm thinking that especially the one for operations that don't see an error is important (with a long default) because in a large cluster, rebalancing can take hours.
Down the road, maybe timeouts can be configurable? I'm thinking that especially the one for operations that don't see an error is important (with a long default) because in a large cluster, rebalancing can take hours.
I think that's a good idea, but definitely for the future!
Resolves #560
Fix the last remaining to-dos for safe lockable cluster operations. (Rolling Upgrade, Scale Up, Scale Down)
TODO: