apache / solr-operator

Official Kubernetes operator for Apache Solr
https://solr.apache.org/operator
Apache License 2.0
243 stars 111 forks source link

Support managed scale down of SolrClouds #559

Closed HoustonPutman closed 1 year ago

HoustonPutman commented 1 year ago

Parent Ticket: https://github.com/apache/solr-operator/issues/536

Currently when a SolrCloud is scaled down (the SolrCloud.spec.replicas option decreases), replicas are left on the Solr Pods that are being decommissioned. This is problematic, because the cluster state will be unhealthy until the SolrCloud is scaled back up, and that pod is recreated.

When doing a rolling restart of SolrClouds with ephemeral data, the Solr operator will move data (replicas) off of a Solr pod before that Solr pod is deleted. This same logic can be used to ensure that the clusterState of Solr is healthy as the cluster is scaled down.

Right now, the safest way of ensuring this is to do a scale down, 1 pod at a time. The current Solr REPLACENODE API does not accept a list of nodes to put the new replicas on. Therefore, if we were trying to remove the last two pods in the cluster at the same time, we couldn't ensure that the replicas of one decommissioned pod don't end up on the other decommissioned pod.

However there is an exception of the cluster is scaling down to 0 pods. There are a couple of things we could do in this case:

I say that in the beginning we just use the second option, as I don't think it will be a popular thing to do anyways, and we can always add in the deletion of all data at another time.

HoustonPutman commented 1 year ago

Note: this ticket will create the first "lockable" cluster operation. https://github.com/apache/solr-operator/issues/560