Closed layoaster closed 3 days ago
I don't know how relevant a readiness probe is at all for historical. The historical will be ready as soon it registers itself with brokers and coordinator. The query requests will be routed to historical even if the readiness probe returns not-ok. since the service discovery happens through zookeeper and not through k8s.
Despite using Zookeeper, if brokers start routing queries to historical nodes that have not yet loaded all the segments, I agree that a historical probe must either be removed or mimic the behavior of a hypothetically correct liveness probe.
Anyway, I guess this can be discussed on a separate Github issue as this topic deviates a bit from the reported bug.
@layoaster not sure if its still relevant but you can use /druid/historical/v1/readiness
in the startupProbe
so until historical is ready to serve the queries it will not be in k8s ready
state. liveness check can use /status/health
, no need of having readiness probe in my opinion as broker sends query to historicals only after it has loaded all segments and its not related to k8s service at all.
@pjain1 That seems like a temporary workaround. Still, IMHO the liveness probe /status/health
current behavior is not correct (or at least unexpected) ...
I'm already using another workaround: adjusting the failure threshold and other settings of the liveness probe to cover for the initialization period I'm currently experiencing.
This is not a critical issue, but I think is worth revisiting the current implementation of the /status/health
probe ...
As for the readiness probe, I agree that it might be redundant as per @abhishekagarwal87 note.
@layoaster I think there is some confusion here. You don't need to adjust thresholds etc. if you have a startupProbe
with /druid/historical/v1/readiness
, then k8s will only start using liveness probe when startupProbe at /druid/historical/v1/readiness
succeeds. This endpoint returns OK only after all segments are loaded from the local cache and it is ready to serve the queries. Historicals also announces itself to brokers only after this. So you can use liveness probe at /status/health
with normal thresholds. I hope its clear now.
You just need to adjust startup probe thresholds as per your historical start times, will depend on how much data is loaded on it. Startup probe is created for this purpose only, read here - https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-startup-probes
This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.
This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.
Affected Version
Druid 27.0.0
Description
I run Druid on a Kubernetes cluster and found out that when restarting a Historical node (rolling upgrades) the liveness probes do not respond until the Historical has fully loaded all the segments on the cache (k8s Persistent Volume). Loading the segments from the cache (disk) takes more than 5 min in my cluster because there are more than 28k segments per historical.
I believe the liveness probe
/status/health
should respond with a 200 as soon as the process is up an reachable (network) regardless of its initialization status.Reporting on how long it takes to initialize and load segments from the cache and deep storage is the purpose of the readiness probe
/druid/historical/v1/readiness
.