caas-team / sparrow

A monitoring tool to gather infrastructure network information
Apache License 2.0
6 stars 4 forks source link

Bug: No reset of metrics after each check run #133

Open y-eight opened 2 months ago

y-eight commented 2 months ago

Is there an existing issue for this?

Current Behavior

Currently the sparrow is exposing metrics for each check. In this example, I will reference the latency check but this is related to all the checks.

The latency check provides metrics with the statuscode as metadata in the metric itself, eg.: sparrow_latency_duration_seconds{status="200",target="https://xyz.telekom.de/"} 0.1750766

If in one of the previous runs/check intervall the status was not 200 (eg. 500) the metric will be exposed with status="500". If the current run is getting a 200 status code the metric will be exposed with status="200". The metric with status="500" is still present. The problem is that the sparrow is not exposing the last state. This is misleading since the status 500 is not the last state.

Expected Behavior

The prom metrics should represent the last state of the check. Therefore the old metrics should be cleared before updating with the current state.

Steps To Reproduce

No response

Relevant logs and/or screenshots, environment information, etc.

image

Who can address the issue?

No response

Anything else?

No response

lvlcn-t commented 2 months ago

@y-eight While this bug is not exactly the same as #77, it's quite similar and I'd see it as one bigger issue that should be addressed in the near future.

puffitos commented 1 month ago

The problem here is quite another; we are probably misusing prometheus metrics. A more appropriate solution would be to use a counter with the status and url labels (to count how many requests were successful or not) and just a latency_duration_seconds, which just returns how much time the last call needed to finish.

This way, we always return the core information of the check (which is the latency) and also expose some additional information.

EDIT:

Just for completion's sake, after the chat with @y-eight, we decided that dragging the status of the response along doesn't make much sense for a latency check.

The latency check should only measure how much it took to execute a request. Everything else may be relevant information, but the critical information that the check should offer is just the total duration in seconds a request took to complete.