Open y-eight opened 2 months ago
@y-eight While this bug is not exactly the same as #77, it's quite similar and I'd see it as one bigger issue that should be addressed in the near future.
The problem here is quite another; we are probably misusing prometheus metrics. A more appropriate solution would be to use a counter with the status and url labels (to count how many requests were successful or not) and just a latency_duration_seconds, which just returns how much time the last call needed to finish.
This way, we always return the core information of the check (which is the latency) and also expose some additional information.
EDIT:
Just for completion's sake, after the chat with @y-eight, we decided that dragging the status of the response along doesn't make much sense for a latency check.
The latency check should only measure how much it took to execute a request. Everything else may be relevant information, but the critical information that the check should offer is just the total duration in seconds a request took to complete.
Is there an existing issue for this?
Current Behavior
Currently the sparrow is exposing metrics for each check. In this example, I will reference the latency check but this is related to all the checks.
The latency check provides metrics with the statuscode as metadata in the metric itself, eg.:
sparrow_latency_duration_seconds{status="200",target="https://xyz.telekom.de/"} 0.1750766
If in one of the previous runs/check intervall the status was not 200 (eg. 500) the metric will be exposed with
status="500"
. If the current run is getting a 200 status code the metric will be exposed withstatus="200"
. The metric withstatus="500"
is still present. The problem is that the sparrow is not exposing the last state. This is misleading since the status 500 is not the last state.Expected Behavior
The prom metrics should represent the last state of the check. Therefore the old metrics should be cleared before updating with the current state.
Steps To Reproduce
No response
Relevant logs and/or screenshots, environment information, etc.
Who can address the issue?
No response
Anything else?
No response