dora-metrics / pelorus

Automate the measurement of organizational behavior
https://pelorus.readthedocs.io/
Apache License 2.0
241 stars 82 forks source link

Mean Time to Restore prometheus rules can miss tickets in calculation #1127

Open etsauer opened 3 months ago

etsauer commented 3 months ago

OpenShift version

Not related to OpenShift

Problem description

There's something wrong with how we are calculating the mean time to restore for individual issues. I'm not sure why, but sometimes we can skip an issue.

Here's a readout from my prometheus:

min by (issue_number, service) (min_over_time(failure_resolution_timestamp{app=~".*pelorus-api.*"}[2d] @ 1710475200)) - min by (issue_number, service) (min_over_time(failure_creation_timestamp{app=~".*pelorus-api.*"}[2d] @ 1710475200))
{issue_number="23", service="github-failure-exporter"}
525
{issue_number="24", service="github-failure-exporter"}
64822
{issue_number="25", service="github-failure-exporter"}
66518
{issue_number="29", service="github-failure-exporter"}
6315
{issue_number="22", service="github-failure-exporter"}
564
sdp:time_to_restore:by_issue{app=~".*pelorus-api.*"}[2d] @ 1710475200

sdp:time_to_restore:by_issue{app="/pelorus-api/", container="github-failure-exporter", endpoint="http", instance="10.129.0.25:8080", issue_number="29", job="github-failure-exporter", namespace="pelorus", pod="github-failure-exporter-1-lnssw", service="github-failure-exporter"}
6315 @1710428928.453
6315 @1710428958.453
6315 @1710428988.453
6315 @1710429018.453
6315 @1710429048.453
6315 @1710429078.453
6315 @1710429108.453
6315 @1710429138.453
6315 @1710429168.453
6315 @1710429198.453

sdp:time_to_restore:by_issue{app="/pelorus-api/", container="github-failure-exporter", endpoint="http", instance="10.129.0.32:8080", issue_number="22", job="github-failure-exporter", namespace="pelorus", pod="github-failure-exporter-1-lnssw", service="github-failure-exporter"}
564 @1710340278.453
564 @1710340308.453
564 @1710340338.453
564 @1710340368.453
564 @1710340398.453
564 @1710340428.453
564 @1710340458.453
564 @1710340488.453
564 @1710340518.453

sdp:time_to_restore:by_issue{app="/pelorus-api/", container="github-failure-exporter", endpoint="http", instance="10.129.0.32:8080", issue_number="23", job="github-failure-exporter", namespace="pelorus", pod="github-failure-exporter-1-lnssw", service="github-failure-exporter"}
525 @1710355968.453
525 @1710355998.453
525 @1710356028.453
525 @1710356058.453
525 @1710356088.453
525 @1710356118.453
525 @1710356148.453

These two queries should yield the same number of results, but they do not.

Steps to reproduce

  1. Install pelorus with github-failure-exporter
  2. Open and close a bunch of github issues

Current behavior

See above

Expected behavior

See Above

Code of Conduct