apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.53k stars 3.71k forks source link

Historical's liveness probe behaving as a readiness probe #15546

Closed layoaster closed 3 days ago

layoaster commented 11 months ago

Affected Version

Druid 27.0.0

Description

I run Druid on a Kubernetes cluster and found out that when restarting a Historical node (rolling upgrades) the liveness probes do not respond until the Historical has fully loaded all the segments on the cache (k8s Persistent Volume). Loading the segments from the cache (disk) takes more than 5 min in my cluster because there are more than 28k segments per historical.

I believe the liveness probe /status/health should respond with a 200 as soon as the process is up an reachable (network) regardless of its initialization status.

Reporting on how long it takes to initialize and load segments from the cache and deep storage is the purpose of the readiness probe /druid/historical/v1/readiness.

abhishekagarwal87 commented 11 months ago

I don't know how relevant a readiness probe is at all for historical. The historical will be ready as soon it registers itself with brokers and coordinator. The query requests will be routed to historical even if the readiness probe returns not-ok. since the service discovery happens through zookeeper and not through k8s.

layoaster commented 11 months ago

Despite using Zookeeper, if brokers start routing queries to historical nodes that have not yet loaded all the segments, I agree that a historical probe must either be removed or mimic the behavior of a hypothetically correct liveness probe.

Anyway, I guess this can be discussed on a separate Github issue as this topic deviates a bit from the reported bug.

pjain1 commented 10 months ago

@layoaster not sure if its still relevant but you can use /druid/historical/v1/readiness in the startupProbe so until historical is ready to serve the queries it will not be in k8s ready state. liveness check can use /status/health, no need of having readiness probe in my opinion as broker sends query to historicals only after it has loaded all segments and its not related to k8s service at all.

layoaster commented 10 months ago

@pjain1 That seems like a temporary workaround. Still, IMHO the liveness probe /status/health current behavior is not correct (or at least unexpected) ...

I'm already using another workaround: adjusting the failure threshold and other settings of the liveness probe to cover for the initialization period I'm currently experiencing.

This is not a critical issue, but I think is worth revisiting the current implementation of the /status/health probe ...

As for the readiness probe, I agree that it might be redundant as per @abhishekagarwal87 note.

pjain1 commented 10 months ago

@layoaster I think there is some confusion here. You don't need to adjust thresholds etc. if you have a startupProbe with /druid/historical/v1/readiness, then k8s will only start using liveness probe when startupProbe at /druid/historical/v1/readiness succeeds. This endpoint returns OK only after all segments are loaded from the local cache and it is ready to serve the queries. Historicals also announces itself to brokers only after this. So you can use liveness probe at /status/health with normal thresholds. I hope its clear now.

You just need to adjust startup probe thresholds as per your historical start times, will depend on how much data is loaded on it. Startup probe is created for this purpose only, read here - https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-startup-probes

github-actions[bot] commented 1 month ago

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

github-actions[bot] commented 3 days ago

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.