apache / solr-operator

Official Kubernetes operator for Apache Solr
https://solr.apache.org/operator
Apache License 2.0
243 stars 112 forks source link

Use Solr's /admin/info/health for pod readiness checks #629

Closed gerlowskija closed 9 months ago

gerlowskija commented 9 months ago

Now that the operator expects Solr versions >=8.11, we can use Solr's /admin/info/health endpoint as our default readiness probes. This change also means that the operator now defaults to different Solr endpoints for liveness vs readiness probes, which satisfies what some consider a best practice.

HoustonPutman commented 9 months ago

Note, this is a part of #616

Also I think it would be good to keep the /system handler in use for the Liveness probe (since that says whether Solr is running or not), and use the /health handler for the Readiness probe. Some kubernetes environments like these probes to be different, and also I see a problem where Solr is restarted on GC pauses (causing ZK live node issues), when that would only make the problem worse. See #504 for some more information.

gerlowskija commented 9 months ago

Ah ok, makes sense. I didn't realize there'd been some discussion around differentiating them - will take a second pass for that shortly.

gerlowskija commented 9 months ago

Ah, good call - done!

gerlowskija commented 9 months ago

Thinking on this a bit more - there is at least one downside to using /admin/info/health for readiness that I'm not sure has been raised.

Unlike the /system, /health will return an error on nodes with bad ZK connections and prevent them from receiving traffic if ZK goes down. That makes sense for lots of traffic like admin requests or updates - no point sending any of those requests to a SolrCloud node that has no ZK. But it'll block other traffic that would've succeeded too - like queries, which Solr can sometimes continue to serve even without ZK.

Overall I think the additional intelligence and fidelity that /health gives us as a readiness probe is worth this slight degradation in how we'd route traffic in response to a total catastrophe. Just wanted to call it out in case others disagree. I'll add a sentence or two documenting this aspect of things and then merge, but I'll leave this out here for a few more days to give folks a chance to chime in.