Why health check an http endpoint rather than backend service's own TCP endpoint?

drnic commented 9 years ago

At Marco's suggestion, I'm looking at switchboard for a different service backend than cf-mysql-release

I see the health check system is hard coded to test for an http endpoint - which I assume is different from the service TCP endpoint - but why?

What is wrong with the idea of health checking the backend service's active TCP connection?

Or how do you recommend that arbitrary backend services co-publish additional HTTP endpoints just for health checking? Is there an agent process that you are using that we can borrow?

I don't really want backend services to have to run an additional agent - it makes process monitoring hard inside docker containers for example - so I'm a little torn about this requirement for an additonal HTTP endpoint for health checking.

cf-gitbot commented 9 years ago

We have created an issue in Pivotal Tracker to manage this. You can view the current status of your issue at: https://www.pivotaltracker.com/story/show/107044728.

menicosia commented 9 years ago

Hi @drnic,

At the time we designed switchboard, I think the idea of a sidecar 1 2 seemed like a good way to go.

TCP connection checking can be prone to false-positives, in that once an application has called bind(2) and listen(2), the OS is going to accept and queue connections even if the application itself isn't accept(2)'ing connections. Up to a point.

Ideally you'd want to write something more advanced, that actually knew to talk to the application to determine if the service is really alert and responsive. There are two places to implement that: as a plug-in to the load-balancer (switchboard) or, in the form of a sidecar.

Docker containers are a whole different ball of wax. For containers, Diego is much more flexible, allowing TCP checks, HTTP checks, and even embedded health command calls lattice docs. That's definitely a different ball of wax than the context switchboard was written in - collocating a sidecar in a BOSH deployment isn't a big deal.

We could certainly entertain the idea of adding TCP checks to switchboard, again, presuming compelling use cases. :)

Marco Nicosia Product Manager Pivotal Software, Inc.

robdimsdale commented 9 years ago

The main reason we check for a separate HTTP endpoint rather than the TCP port of the service itself is to allow for cases where health (and hence routability) is more complicated than whether a process listens on a port.

Giving this proxy was to be used in front of a mysql cluster, this lead us to one of two options: either hard-code mysql-specific knowledge in this application, or require an external process to provide application-aware health status. We opted for the latter, as it offers more extensibility and flexibility.

It shouldn't be too hard to add a flag and some extra logic to force the healthcheck to look at the TCP port instead of an HTTP endpoint.

menicosia commented 9 years ago

Hey @drnic,

I'm going to go ahead and close this issue. As always, feel free to re-open any time.

Marco Nicosia Product Manager Pivotal Software, Inc.

cloudfoundry / switchboard

Why health check an http endpoint rather than backend service's own TCP endpoint? #7