datastax / pulsar-helm-chart

Apache Pulsar Helm chart
Apache License 2.0
46 stars 38 forks source link

use static page for broker liveness probe #265

Open pgier opened 1 year ago

pgier commented 1 year ago

This brings the broker liveness probe in sync with the community Helm chart, and should use less resources than the health_check script. In Astra Streaming we switch to this liveness probe instead of using the metrics endpoint because we were getting a lot of errors in the broker logs (https://github.com/riptano/astra-streaming/pull/513).

pgier commented 1 year ago

@cdbartholomew PTAL. This is how the Apache community broker is configured and we're doing the same in Astra Streaming.

pgier commented 1 year ago

In Astra we were previously seeing brokers regularly restarting because the health check was failing, possibly incorrectly (maybe @zzzming has more info) when using the health check. So we switched to use the metrics endpoint, but then we were seeing brokers stuck in a running but not ready state. That issue seems much better in the current 2.10 versions that we're running. We switched to using the static page instead of the metrics endpoint in Astra, and it seems to be fine for the past couple weeks.

michaeljmarshall commented 1 year ago

In Astra we were previously seeing brokers regularly restarting because the health check was failing, possibly incorrectly (maybe @zzzming has more info) when using the health check. So we switched to use the metrics endpoint, but then we were seeing brokers stuck in a running but not ready state. That issue seems much better in the current 2.10 versions that we're running. We switched to using the static page instead of the metrics endpoint in Astra, and it seems to be fine for the past couple weeks.

It'd be really helpful to know why the health check was failing. Another side effect of this change is that the pod could fail its readiness probe without failing the liveness probe, which can lead to problems with DNS lookups when deploying the brokers as a statefulset.

pgier commented 1 year ago

Part of the issue was that the healthcheck topics would build up a very large backlog. Maybe the healthcheck was timing out and not acknowledging messages, and this was causing it to fail?

michaeljmarshall commented 1 year ago

Do we have an issue opened in the upstream project? That sounds like a bug.

pgier commented 1 year ago

@michaeljmarshall I think the issue was fixed in 2.10. At least we haven't seen it in the last couple months. Maybe we need a new endpoint specific to the liveness check?

michaeljmarshall commented 1 year ago

A dedicated liveness check could make sense. We'd just need to find the right things to check. I thought about this a few months ago, but I didn't come up with a good solution. Maybe it is worth a discussion on the dev list to ask "when is a broker alive and when is it ready?"