Closed walterddr closed 1 year ago
We need this check in our environment. Please do not remove it.
Instead, you can use this config variable (set it to 0) to disable it altogether.
ok yes that would be changing the semantics of a REST API.
counter proposal
/health
API behavior./health/instance
API that strip table/segment status callback from server health check
/health/services
API that returns a similar to PinotServiceManagerHealthCheck REST API that returns a map of serviceStatusCallback value
@mcvsubbu @mayankshriv @npawar ^
The existing behavior of loading all segments was put in place to allow for rolling restarts and upgrades. With this behavior, we can use healthcheck after an upgrade or restart and rest assured that we don't compromise on availability. @walterddr how do you propose to solve this problem in the container environment?
I think the issue I am trying to solve here is orthogonal to the rolling restart problem, which is an intended restart. I am happy to create another issue to address that for the containerized environment.
this issue mainly discusses the unintentional restart triggered by the containerized environment. Here I propose to create additional endpoints to retrieve "instance availability" and "data availability" separately. It shouldn't affect how the rolling restart issue being addressed, containerized or not.
We have the same problem where servers take too long to go healthy and the ASG recycles them forever. This is for intentional restarts as well as unintentional ones. Rewording what you said to see if we agrees; there does seem to be some issue here where the /health
endpoint is doing too much:
I'm +1 for your counter proposal. We can use /health/instance
for whatever instance/container manager is being used and leave /health
as is for backwards compatibility.
+1 for new endpoint, yes
I'd suggest adding a new API with more information in the response:
It is much more flexible this way, and different client can choose how to parse the response. E.g. liveness check can pass when the server is not BAD
, readiness check can pass when server is GOOD
One concern I have with a more complex API is most load balancers only allow configuring the path and then look at the status code of the response.
So having both /health/instance and /health/services like in the proposal makes most sense to me.
One concern I have with a more complex API is most load balancers only allow configuring the path and then look at the status code of the response.
So having both /health/instance and /health/services like in the proposal makes most sense to me.
@jadami10 What if we return different status code for different status? Current API only supports 200
and 503
, which cannot represent all the available status.
One problem is that different client might have different logic on parsing the pinot service status, and fixing the return type can reduce the flexibility. We don't want to add an extra new API for each different status check logic.
I don't think your typical ELB parses that way. Any non-200 code is considered failing. And in this case we want a server that's up but downloading segments/catching up to be a 200 on one endpoint and non-200 on another.
@walterddr, do you plan to add a healthcheck metrics for the server while you're in here
HEALTHCHECK_OK_CALLS("healthcheck", true),
HEALTHCHECK_BAD_CALLS("healthcheck", true),
similar what broker and controller have?
IMO it is best to use different endpoints instead of the same endpoint with varying types of response
health/instance
for liveness probe; and use the normal health
for readiness probe.health/service
to return the entire service status callback map. Seems like it is causing some confusion and addressing both issues at the same time also distract the discussion. So I would say let's focus on the main purpose of the discussion to find a way to allow containerization software to know whether a pinot server is alive vs. a pinot server is ready for serving queries.
if there's a way to easily configure k8s or ELB to differentiate the 2 server state. I am ok with using just one endpoint.
Checked other systems, and seems health check is mostly for liveness instead of readiness. Since we already use it as the readiness check, we can leave it as is for backward compatibility.
Based on the requirement, I'd suggest making the api more explicit on the purpose:
/health/liveness
for liveness/health/readiness
for readinessI feel using instance
and service
is not as clear
sounds good. I will create these. ^ @jadami10 any follow up?
nope, those look great
Currently BaseServerStarter will register status health check callback with all the enabled table on that server. It will only return health OK once all table/segments are loaded.
This will cause problem for server with large amount of table/data to be loaded, running in a container environment.
Proposal
/health
API behavior./health/liveness
for liveness/health/readiness
for readinessReference
[OLD PROPOSAL]
/health/instance
API that strip table/segment status callback from server health check/health/services
API that returns a similar to PinotServiceManagerHealthCheck REST API that returns a map of serviceStatusCallback value[OLDER PROPOSAL] propose to strip table/segment status callback from server health check endpoint and only check server's health. and create /health/services/ endpoint to check server + table resource health and only returns OK if everything is healthy