Closed janhoy closed 1 year ago
@janhoy we should also create a Solr JIRA issue for this, to fix Cloud-aware clients and internal shard requests.
More info: We can fix this for simple use cases where users have clouds that all collections are single-sharded and each collection has a replica on all nodes. That way, Solr has no need to send the request to another node internally. If a collection is multi-sharded, or a replica of the collection does not exist on all nodes, then Solr might have to forward requests throughout the cluster. Solr is not aware of the podConditions we are using to solve this in Kubernetes, so we need to think of another solution to fix this inside of Solr.
In the meantime #530 is a great start.
@janhoy we should also create a Solr JIRA issue for this, to fix Cloud-aware clients and internal shard requests.
Sure, I can create one. Do you have a clear idea of how it would work? Now, SolrJ considers collection-state combined with live_nodes
to decide what replicas to query. Would we need some new per-node-state znode in Zookeeper to flag a node as "draining", and then let SolrJ act on that?
Not a clear idea yet.
Would we need some new per-node-state znode in Zookeeper to flag a node as "draining", and then let SolrJ act on that?
That would work, but I'm not sure we'd want to restrict it to just "draining". We might want to send requests elsewhere for other reasons too.
As discussed in slack https://apachesolr.slack.com/archives/C022UMAPZ0V/p1676970790552379
When the operator restarts the cluster, e.g. during a version upgrade, there is no guarantee that a Solr POD is marked as not ready before
solr stop
is called. Thus clients may experience connection error during the restart.@HoustonPutman suggests we can implement a custom readiness gate https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-readiness-gate to control this better.