Closed gerlowskija closed 9 months ago
Note, this is a part of #616
Also I think it would be good to keep the /system
handler in use for the Liveness probe (since that says whether Solr is running or not), and use the /health
handler for the Readiness probe. Some kubernetes environments like these probes to be different, and also I see a problem where Solr is restarted on GC pauses (causing ZK live node issues), when that would only make the problem worse. See #504 for some more information.
Ah ok, makes sense. I didn't realize there'd been some discussion around differentiating them - will take a second pass for that shortly.
Ah, good call - done!
Thinking on this a bit more - there is at least one downside to using /admin/info/health
for readiness that I'm not sure has been raised.
Unlike the /system
, /health
will return an error on nodes with bad ZK connections and prevent them from receiving traffic if ZK goes down. That makes sense for lots of traffic like admin requests or updates - no point sending any of those requests to a SolrCloud node that has no ZK. But it'll block other traffic that would've succeeded too - like queries, which Solr can sometimes continue to serve even without ZK.
Overall I think the additional intelligence and fidelity that /health
gives us as a readiness probe is worth this slight degradation in how we'd route traffic in response to a total catastrophe. Just wanted to call it out in case others disagree. I'll add a sentence or two documenting this aspect of things and then merge, but I'll leave this out here for a few more days to give folks a chance to chime in.
Now that the operator expects Solr versions >=8.11, we can use Solr's /admin/info/health endpoint as our default readiness probes. This change also means that the operator now defaults to different Solr endpoints for liveness vs readiness probes, which satisfies what some consider a best practice.