Support disruption free rolling restart

apache / solr-operator

Official Kubernetes operator for Apache Solr

https://solr.apache.org/operator

Apache License 2.0

246 stars 111 forks source link

Support disruption free rolling restart #529

Closed janhoy closed 1 year ago

janhoy commented 1 year ago

As discussed in slack https://apachesolr.slack.com/archives/C022UMAPZ0V/p1676970790552379

When the operator restarts the cluster, e.g. during a version upgrade, there is no guarantee that a Solr POD is marked as not ready before solr stop is called. Thus clients may experience connection error during the restart.

@HoustonPutman suggests we can implement a custom readiness gate https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-readiness-gate to control this better.

HoustonPutman commented 1 year ago

@janhoy we should also create a Solr JIRA issue for this, to fix Cloud-aware clients and internal shard requests.

More info: We can fix this for simple use cases where users have clouds that all collections are single-sharded and each collection has a replica on all nodes. That way, Solr has no need to send the request to another node internally. If a collection is multi-sharded, or a replica of the collection does not exist on all nodes, then Solr might have to forward requests throughout the cluster. Solr is not aware of the podConditions we are using to solve this in Kubernetes, so we need to think of another solution to fix this inside of Solr.

In the meantime #530 is a great start.

janhoy commented 1 year ago

@janhoy we should also create a Solr JIRA issue for this, to fix Cloud-aware clients and internal shard requests.

Sure, I can create one. Do you have a clear idea of how it would work? Now, SolrJ considers collection-state combined with live_nodes to decide what replicas to query. Would we need some new per-node-state znode in Zookeeper to flag a node as "draining", and then let SolrJ act on that?

HoustonPutman commented 1 year ago

Not a clear idea yet.

Would we need some new per-node-state znode in Zookeeper to flag a node as "draining", and then let SolrJ act on that?

That would work, but I'm not sure we'd want to restrict it to just "draining". We might want to send requests elsewhere for other reasons too.

janhoy commented 1 year ago

https://issues.apache.org/jira/browse/SOLR-16722