checkly / public-roadmap

Checkly public roadmap. All planned features, updates and tweaks.
https://checklyhq.com
37 stars 7 forks source link

Flapping detection #186

Open DmitryFrolovTri opened 2 years ago

DmitryFrolovTri commented 2 years ago

Is your feature request related to a problem? Please describe. If a site being monitored is behind a load balancer(which is like all sites now) it is possible that only 1 node of N would be failing the check once in a while. Checkly would be sending and clearing alert in this case not allowing the support team to react as the incident would not stay open for long. So Problem statement: checkly is not able to have an open alert for a check that frequently succeedes and sometimes fails. It can only have an open alert for a check that is failing at this moment.

Describe the solution you'd like I want checkly to inform us via alert in cases when a site has an infrequent failure which repeats after some number of checks.

In the alert set up, where rules are defined have a radiobutton for a different way of detecting a failure. Let's call that "flapping detection logic" (please have a better name for it) :)

Idea of such check is:

Then during check lifetime if during the specified time duration the number of failed checks is >= checks numbers the alert is raised. During next time duration If there is already an open alert and again number of failed checks is >= checks numbers then alert is kept open otherwise it is closed.

Such way of alert generation could allow flapping detection and would also alert on total downtime.

For example for me: We have 13 nodes behind a web site and sometimes randomly 1 of those would fail and keep failing. Normal logic - checkly would raise 1 alert and close it once it hits randomly this one node. With above logic I could setup following - if during 10 checks(10 minutes) I have one or more failure I would like to alert, which would stay open while we continuosly have this or more number of alerts in those 10 minute intervals.

Describe alternatives you've considered There is no other way to implement this with current logic However, since I am sure the above algorithm is not the only one for flapping detection any other AI or smart or self-adjusting mechanims is good as well.

tnolet commented 2 years ago

@DmitryFrolovTri thanks for the extensive write up. I think you are essentially describing SLO's, where there is an error budget for a time period. This is something I will keep in mind

alexnoyes commented 2 years ago

If the service behind the LB adds it's node ID or GUID to the header of the response, could you get Checkly to store/action that to identify the erroneous backend service?