eclipse / microprofile-health

microprofile-health
Apache License 2.0
107 stars 61 forks source link

Failed health checks returning 503 instead of 500 #314

Closed nderwin closed 3 years ago

nderwin commented 3 years ago

While investigating https://github.com/smallrye/smallrye-health/issues/161, and comparing to the spec (https://github.com/eclipse/microprofile-health/blob/master/spec/src/main/asciidoc/protocol-wireformat.asciidoc#status-codes), it looks like if a (runtime) exception is thrown from a health check method, eventually the response should be a status of 500 instead of 503. In order to facilitate that, it also looks like a new enum value is needed for the status to represent the "undetermined" state.

There are already TCK tests throwing exceptions out of the health checks, but they are validating a response status of 503.

xstefank commented 3 years ago

@nderwin so sorry I didn't get to smallrye/smallrye-health#161 sooner. I will check it asap. But the specification is correct. 500 is reserved for unexpected processing errors that are not able to be handled in the health implementation (e.g., OOME) -- that is for responses that can't put the JSON response together. I don't think we have TCK test that throws RE. Can you point me to what you mean?

If you agree, can we close this issue and continue in sr-health please as this is specific to the implementation.

nderwin commented 3 years ago

@xstefank Maybe I was confusing the TCK test code between this project and smallrye-health, as I can't find it now in the TCK tests. I did find one in the smallrye test code (https://github.com/smallrye/smallrye-health/blob/main/implementation/src/test/java/io/smallrye/health/SmallRyeHealthReporterTest.java#L46), but the tests there are looking for a 503 response code. If I'm understanding correctly, the above test should verify a 500 response code.

As for this project, does the TCK need to be updated to add tests for this situation? If not, I guess we can close this issue.

xstefank commented 3 years ago

@nderwin yes, that test is specific to SR. Please see my response in the SR issue. I believe we shouldn't kill the application if health checks throw exceptions. But @pgunapal how do you handle it in OL please?

pgunapal commented 3 years ago

@xstefank In OL, whenever we can't construct the correct JSON response to return a 200 or a 503 or unable to process the health checks in the application, for example, an exception is thrown (RuntimeException), we catch the exception and return a 500 response code. I just read the specification and it seems to be confusing. In Appendix B of the protocol-wireformat, it says the Status should be Undetermined for HTTP Status 500, but in the follow on example for Status 500, we see that the status is DOWN for a Response code of 500. The comment also says Request process failed (i.e. error in procedure), which would mean if we can not process the health checks (regardless, of any occurred exceptions (including RuntimeExceptions) or unable to return a correct JSON response with code 200/503), it should return 500, with an Undetermined status. I think the example should be fixed to reflect the Appendix B table to clarify it, WDYT?

xstefank commented 3 years ago

@pgunapal yes, that will be only a typo in example. But I don't think that caught exception was ever meant to return 500. We should be aligned and always return 200/503 if we can construct JSON response. Please check also my response in the linked SR issue which was accepted by the user.

xstefank commented 3 years ago

Yes, so I guess that example was written by the same person that implemented this in SR. So I think we should be returning 503 if we can catch the exception to stay aligned with our 200/503 response. 500 will be reserved for generic errors that cannot be processed by the implementation, e.g. OOME. I believe that in such cases HTTP server generally responds with 500 (internal server error).

nderwin commented 3 years ago

@xstefank Alright, I'll go ahead and close this one, however, the spec might need a little more clarification to note that the documented 500 response is not intended to be produced by an MP implementation, rather it's a consequence of the application being health checked being not available to serve any meaningful response.

xstefank commented 3 years ago

@nderwin in every case, thank you for bringing this up in the spec. This is really unfortunate overlook from our side which needs clarification.

pgunapal commented 3 years ago

@xstefank I agree, the spec would need clarification on the 500.