distribworks / dkron

Dkron - Distributed, fault tolerant job scheduling system https://dkron.io
GNU Lesser General Public License v3.0
4.25k stars 376 forks source link

Health endpoint improvement #652

Open davidgengenbach opened 4 years ago

davidgengenbach commented 4 years ago

Is your feature request related to a problem? Please describe. We had a problem with a killed plugin process (due to OOM) which resulted in non-executing jobs.

Describe the solution you'd like The health endpoint might be used to check whether all plugin processes are up and running. In general, more health checks would be helpful, e.g. cluster health?

The endpoint could return a non-200 status code when the instance is not healthy!

vcastellm commented 4 years ago

Already on the roadmap, will work on this.

vcastellm commented 4 years ago

@davidgengenbach not really the improvement you mention but I think it's better to fail fast in case of a missing plugin. In case of using as a service the OS supervisor will take care of restarting. This is the case with processor plugins.

davidgengenbach commented 4 years ago

@victorcoder Yes, I would agree. Having non-functional (= killed) plugins should really result in a killed main-process.

The linked PR will only kill the main process when a job has been executed unsuccessfully (e.g. by a plugin error) which may be far later than the actual plugin process exiting - a periodic health check could circumvent this.