apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.39k stars 1.26k forks source link

Make composite healthcheck for ServiceManager #5950

Open codefromthecrypt opened 4 years ago

codefromthecrypt commented 4 years ago

Per #5850 we should have a /health endpoint for our service state. Composite services like ServiceManager should report their individual components, manifesting an aggregate healthy IFF all are healthy. This helps ensure traffic doesn't pass through when things aren't finished booting up, or if they crash after the fact.

This was discussed in https://apache-pinot.slack.com/archives/CDRCA57FC/p1598498630061400

Make a composite health endpoint similar to this:

  @GET
  @Produces(MediaType.APPLICATION_JSON)
  @Path("/instances")
  @ApiOperation(value = "Get Pinot Instances Status")
  @ApiResponses(value = {@ApiResponse(code = 200, message = "Instance Status"), @ApiResponse(code = 500, message = "Internal server error")})
  public Map<String, PinotInstanceStatus> getPinotAllInstancesStatus() {
    Map<String, PinotInstanceStatus> results = new HashMap<>();
    for (String instanceId : _pinotServiceManager.getRunningInstanceIds()) {
      results.put(instanceId, _pinotServiceManager.getInstanceStatus(instanceId));
    }
    return results;
  }

and in doing so it similar to discovering and calling each of https://github.com/apache/incubator-pinot/pull/5846

main thing is to represent the composite of its health so you can know if the process should be in service or not.

for example, i noticed one part of process fail in docker due to zip extraction maybe take too long no idea. still passes health check! that's bad as it fails other thing.

cc @daniellavoie @fx19880617

xiangfu0 commented 4 years ago

Currently the health check is implemented in: https://github.com/apache/incubator-pinot/blob/master/pinot-tools/src/main/java/org/apache/pinot/tools/service/api/resources/PinotServiceManagerHealthCheck.java

I think ServiceStatus.getServiceStatus(); will return the composite health check. ServiceStatus holds a MapBasedMultipleCallbackServiceStatusCallback which contains the callback for all the components.

codefromthecrypt commented 4 years ago

sounds like the right spot. main thing after that is the rather simple logic to scan them before deciding the ultimate HTTP status (200 vs 503) where if any status is not ready == 503 conversely iff they are all ready == 200

xiangfu0 commented 4 years ago

My initial thoughts was that user expects all the components to be up to consider the standalone process is healthy. So I put the hight level /health uses 200 for (all healthy) or 503(something is wrong).

Do you suggest to have some error code for partial failure or all failure? Also this is still missing the information of which component is in bad state.

Meanwhile users can use /health/services to get all health checks for all the components.

codefromthecrypt commented 4 years ago

I don't think we need a code difference on partial fail. still use 503. if someone wants to they can see the partial fail in the json of the response.

xiangfu0 commented 4 years ago

Sounds good, then I think current code behavior is already as you described? Have you observed the situation that /health returns 200 but /health/services give some components failure?

codefromthecrypt commented 4 years ago

no I haven't seen partial failure scenario using this. I will switch to it. meanwhile, it might be work back filling a test that proves PinotServiceManagerHealthCheck gives 503 on partial failure or if already does, link that and close this out.

codefromthecrypt commented 4 years ago

PS I have seen a partial failure, just not one using this. That said, there is code now that checks exit status at startup, and the partial failure I saw was at bootstrap (error extracting the swagger stuff)

xiangfu0 commented 4 years ago

oh? so swagger fail the component? or you saw swagger errors in the log?

codefromthecrypt commented 4 years ago

it was a jar error ultimately during bootstrap which no longer exists (after extracting all of them). sadly I don't have a copy of the message. the surprise was that the listener still worked heh. I think this error will be less possible after recent commit which checks exception and boolean status strictly.