checkly / public-roadmap

Checkly public roadmap. All planned features, updates and tweaks.
https://checklyhq.com
37 stars 7 forks source link

Double check on degradations #312

Open Alex-shved opened 1 year ago

Alex-shved commented 1 year ago

Is your feature request related to a problem? We have the degradation check set to 2 seconds. Over the past few months, the number of false positives from checkly has begun to rise. Quite often, triggers began to occur when the query execution time was more than 2 seconds due to delays in the DNS and TCP. The problem with DNS is probably related to the coincidence of the time of the test request and the time of resetting the DNS cache in the AWS. If I'm not mistaken, it is their DNS that is used to resolve names for queries from checkly.

DNS

The issue with the TCP is probably related to the subsidence on the network. Sometimes a request from a checker to our service takes much longer than usual ~2-4 seconds, this demonstrates problems on the network, but not problems in the operation of the service itself.

TCP

The issue with waiting at the start of the connection did not arise; like both of those described above, it refers to "CONNECTION START"

The 3 described points most often do not relate to the work of the tested service, and create a distortion of statistics.

Describe the solution you'd like Implement a re-check in case of degradation triggering. Similarly, as implemented in cases of failure.

Describe alternatives you've considered As an alternative, I've tried deactivating the degradation alert level by setting the failed check's trigger time to be less than the degradation's trigger time. In this situation, I was able to get rid of the problems associated with DNS and TCP, but any slowdown in the service is recognized as failing, which in turn also distorts the statistics and requires an additional study of the failed check I would like to have data on degradation and failed checks

tnolet commented 1 year ago

@Alex-shved sorry for the late reply!