apache / solr-operator

Official Kubernetes operator for Apache Solr
https://solr.apache.org/operator
Apache License 2.0
246 stars 111 forks source link

Create a lock for involved SolrCloud operations #560

Closed HoustonPutman closed 1 year ago

HoustonPutman commented 1 year ago

The Solr Operator has started to perform pretty involved management operations, including rolling upgrades (that includes data migration for ephemeral clusters), and scaling (that can include data migration from decommissioned pods and to newly created pods).

These operations will not necessarily behave well together, so it stands to reason that we should only allow one operation at a time to occur.

Another important thing here, is that many of these operations require doing async collections API calls to Solr. It's imperative that these operations are finished once they begin. So without a locking mechanism, that stores the state of these operations outside of the expected state of the SolrCloud, the cloud could easily be reverted back to a previous state that no longer requires the operation. Since the solr operator is stateless, it would not know that these operations were taking place in the previous reconcile loop, and thus a management operation could be halted halfway through.

Each management command can be locked down to ensure that this doesn't happen, however with a centralized locking mechanism, its easy to ensure that any "lockable" cluster operation must finish once it has been started.

A good place to store this metadata is in the StatefulSet annotations. We can expect that users won't try to update these themselves, and the only time they might go away is if the StatefulSet itself is deleted. In this case, the Solr data might no longer be there, and there is no guarantee as to the state of the cluster once the StatefulSet is re-created. So it's probably best to stop the cluster operation anyways.

Currently the two "lockable" cluster operations would be:

HoustonPutman commented 1 year ago

The only thing left to do here, is mitigate failures during the lockable operations.

Such as a user can provide a bad config while a scale up is happening. If that were the case, the new pods would not come up potentially, and the scale up will never unlock. (and a rolling restart can't happen, because the cluster is locked). The scale up should be able to fail, give up its lock and try again later.