Recruitee / existence

Asynchronous dependency health checks library.
Apache License 2.0
21 stars 0 forks source link

Suggestion: Include Predefined Checks for Supervision Tree Readiness #5

Open eric0fw opened 2 months ago

eric0fw commented 2 months ago

Hi,

First, I want to thank you for developing Existence. I have been exploring its features and it seems like an excellent tool for managing health checks in Elixir applications.

I have a suggestion for a predefined check that I believe could enhance the library: a built-in check to ensure that the entire supervision tree has fully started, utilizing a "terminator" process or similar mechanism. This would be particularly useful for readiness probes in Kubernetes deployments.

Proposed Feature:

Benefits:

  1. Reliability: Ensures that all necessary processes are up and running before marking the application as ready.
  2. Ease of Use: Provides a ready-to-use solution for a common requirement, reducing the need for custom implementations.
  3. Kubernetes Integration: Enhances the ability to use Existence with Kubernetes readiness probes, providing a more robust setup for production deployments.
  4. Simplified Readiness Check: This check wouldn't be used on its own but could be useful in determining readiness without the need to implement numerous individual checks.

Considerations:

I appreciate your time and consideration of this suggestion. Looking forward to hearing your thoughts!

Best regards,

up2jj commented 2 months ago

Hey @eric0fw,

Thanks for your kind words :) We are using the library in production and it is not causing any problems.

The additional supervision tree check obviously makes sense for environments running in a k8 cluster. Depending on the complexity of the application, checking just the last process (child) may not be sufficient to consider the application ready:

  1. previous processes may have been started, but their operation may cause repeated restarts. In such a situation, Supervisor.which_children/1 returns a status of :restarting instead of pid for such a process.
  2. previous processes may have been started, but when the callback GenServer.init/1 returns :ignore, initialisation of the supervision tree continues but the process is ignored:

    Returning :ignore will cause start_link/3 to return :ignore and the process will exit normally without entering the loop or calling terminate/2. If used when part of a supervision tree the parent supervisor will not fail to start nor immediately try to restart the GenServer. The remainder of the supervision tree will be started and so the GenServer should not be required by other processes.

source: https://hexdocs.pm/elixir/1.17.2/GenServer.html#c:init/1

In both of these situations I would consider the supervision tree not ready, especially if all processes are required to run the app. Nevertheless, the problem is interesting and we could look into it.

@andrzej-mag I would be happy to hear your thoughts on this topic.

andrzej-mag commented 2 days ago

I am sorry for late reply. I didn't receive any notification and I wasn't aware of this discussion.

Application supervision trees are created from stable and trusted processes usually. If some supervised OTP process would crash infrequently, process would be restarted so quickly that it would be way below any orchestration platform time resolution threshold. If supervised process would crash continuously, application supervisor should terminate itself and application - no need for any additional health-checks. I don't see any practical value in health-checking application supervision tree or single tree processes. Situation is different with health-checking unsupervised external dependencies which can fail due to uncontrolled reasons (eg. network issue) for extended time.

To health-check some important OTP process you can use for example is_pid(Process.whereis(:some_important_process)) in the check callback implementation. If checked OTP process is supervised, supervisor restart intensity and period in reference to orchestration timeouts should be considered probably.

Existence should be started as a last process in the supervision tree as all health-checked dependencies should be started first anyway. Library will act then as a "terminator" process mentioned in the @eric0fw comment. To use Existence as a readiness check with dependencies requiring substantial time to initialize you can set initial_delay in the check configuration.