ava-innersource / Liquid-Application-Framework-1.0-deprecated

Liquid is a framework to speed up the development of microservices
MIT License
25 stars 14 forks source link

Design and implement health check for services #194

Open bruno-brant opened 4 years ago

bruno-brant commented 4 years ago

After removing the old broken health check system (#193) and following the proposal on #60, we need to rethink our health check approach. This is the discussion thread that should define what do we want to do.

bruno-brant commented 4 years ago

My main concerned, which I voiced personally some times, is the cascaded failure provoked by health checks approaches: if A depends on B depends on C and C goes out of service, both A and B also goes down.

Now take into account that Kubernetes, the target environment for Liquid apps, will do a back off if it tries to start a pod multiple times unsuccessfully. The back off is, by default, 2 hours (I think, and need to look it up). This means that if C fails for some amount of time, A and B will be out for 2 hours, no matter that C is back online.

So, in sum, I don't think health checking everything is the right approach, unless we are also proposing that we configure Kubernetes to "never" back off (or back off or just a few minutes, for instance).


What if C is down and I call B, you may be asking? Well, you will get some sort of 500 error. Which is about the same that you will get if C is down which causes B to go down - a 500 error.

The core problem is that health checking is used to reboot the application. Signaling a bad health is the same as saying, "I think you should restart this app". This is great if, for instance, you need to refresh credentials which is only done during boot, or if you have a memory leak and your pod is running out. But it doesn't help you at all if a remote system is failing, since it's the remote system that needs booting.