Open y-eight opened 11 months ago
Currently, 3 different latency metrics are available.
If the health check fails (internally) the latency time will be 0. The status code as well.
This might be ok for the counter and latency metrics but might be not the best practice for the histogram. The buckets will be filled.
Example with 2 errors and 308 total requests:
# HELP sparrow_latency_duration Latency of targets in seconds # TYPE sparrow_latency_duration histogram sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.005"} **2** sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.01"} **2** sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.025"} **2** sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.05"} **2** sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.1"} **2** sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.25"} **2** sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.5"} 288 sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="1"} 307 sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="2.5"} 308 sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="5"} 308 sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="10"} 308 sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="+Inf"} 308 sparrow_latency_duration_sum{target="https://gitlab.devops.telekom.de"} 120.39378972299998 sparrow_latency_duration_count{target="https://gitlab.devops.telekom.de"} 308
As @puffitos stated in https://github.com/caas-team/sparrow/pull/45 we should probably solve this with labelling or another set of metrics. E.g. label for the checks state.
Maybe we should create an extra metric for failed requests and move those failed requests there. This would fix the issue with the buckets filling up, and also provide an easy way for monitoring failed requests
Problem to investigate & solve
Currently, 3 different latency metrics are available.
If the health check fails (internally) the latency time will be 0. The status code as well.
This might be ok for the counter and latency metrics but might be not the best practice for the histogram. The buckets will be filled.
Example with 2 errors and 308 total requests:
As @puffitos stated in https://github.com/caas-team/sparrow/pull/45 we should probably solve this with labelling or another set of metrics. E.g. label for the checks state.