Design and implement health check for services

My main concerned, which I voiced personally some times, is the cascaded failure provoked by health checks approaches: if A depends on B depends on C and C goes out of service, both A and B also goes down.

Now take into account that Kubernetes, the target environment for Liquid apps, will do a back off if it tries to start a pod multiple times unsuccessfully. The back off is, by default, 2 hours (I think, and need to look it up). This means that if C fails for some amount of time, A and B will be out for 2 hours, no matter that C is back online.

So, in sum, I don't think health checking everything is the right approach, unless we are also proposing that we configure Kubernetes to "never" back off (or back off or just a few minutes, for instance).

What if C is down and I call B, you may be asking? Well, you will get some sort of 500 error. Which is about the same that you will get if C is down which causes B to go down - a 500 error.

The core problem is that health checking is used to reboot the application. Signaling a bad health is the same as saying, "I think you should restart this app". This is great if, for instance, you need to refresh credentials which is only done during boot, or if you have a memory leak and your pod is running out. But it doesn't help you at all if a remote system is failing, since it's the remote system that needs booting.

ava-innersource / Liquid-Application-Framework-1.0-deprecated

Design and implement health check for services #194