Open codefromthecrypt opened 4 years ago
Currently the health check is implemented in: https://github.com/apache/incubator-pinot/blob/master/pinot-tools/src/main/java/org/apache/pinot/tools/service/api/resources/PinotServiceManagerHealthCheck.java
I think ServiceStatus.getServiceStatus();
will return the composite health check.
ServiceStatus
holds a MapBasedMultipleCallbackServiceStatusCallback
which contains the callback for all the components.
sounds like the right spot. main thing after that is the rather simple logic to scan them before deciding the ultimate HTTP status (200 vs 503) where if any status is not ready == 503 conversely iff they are all ready == 200
My initial thoughts was that user expects all the components to be up to consider the standalone process is healthy.
So I put the hight level /health
uses 200 for (all healthy) or 503(something is wrong).
Do you suggest to have some error code for partial failure or all failure? Also this is still missing the information of which component is in bad state.
Meanwhile users can use /health/services
to get all health checks for all the components.
I don't think we need a code difference on partial fail. still use 503. if someone wants to they can see the partial fail in the json of the response.
Sounds good, then I think current code behavior is already as you described?
Have you observed the situation that /health
returns 200 but /health/services
give some components failure?
no I haven't seen partial failure scenario using this. I will switch to it. meanwhile, it might be work back filling a test that proves PinotServiceManagerHealthCheck gives 503 on partial failure or if already does, link that and close this out.
PS I have seen a partial failure, just not one using this. That said, there is code now that checks exit status at startup, and the partial failure I saw was at bootstrap (error extracting the swagger stuff)
oh? so swagger fail the component? or you saw swagger errors in the log?
it was a jar error ultimately during bootstrap which no longer exists (after extracting all of them). sadly I don't have a copy of the message. the surprise was that the listener still worked heh. I think this error will be less possible after recent commit which checks exception and boolean status strictly.
Per #5850 we should have a /health endpoint for our service state. Composite services like ServiceManager should report their individual components, manifesting an aggregate healthy IFF all are healthy. This helps ensure traffic doesn't pass through when things aren't finished booting up, or if they crash after the fact.
This was discussed in https://apache-pinot.slack.com/archives/CDRCA57FC/p1598498630061400
Make a composite health endpoint similar to this:
and in doing so it similar to discovering and calling each of https://github.com/apache/incubator-pinot/pull/5846
main thing is to represent the composite of its health so you can know if the process should be in service or not.
for example, i noticed one part of process fail in docker due to zip extraction maybe take too long no idea. still passes health check! that's bad as it fails other thing.
cc @daniellavoie @fx19880617