dora-metrics / pelorus

Automate the measurement of organizational behavior
https://pelorus.readthedocs.io/
Apache License 2.0
245 stars 83 forks source link

Prometheus rework #961

Open mateusoliveira43 opened 1 year ago

mateusoliveira43 commented 1 year ago

Fix

After #939, team members have agreed in the following:

Requirements

weshayutin commented 1 year ago

this does in fact seem to be a nice written representation of what we spoke about.

I don't recall discussing these two, but seems like the right move to me.

mpryc commented 1 year ago
  • [ ] remove deploy active metric

  • This is resolved by https://github.com/dora-metrics/pelorus/pull/943

  • [ ] OPTIONAL Test if it is possible to deploytime metric do not use exactly timestamp (prometheus would use timestamp it received (because of ~3horus receive problem) and use exactly timestamp as value)

  • agree, additional tests:

  • check scenario where we deploy test application let's call it TestApp

  • deploy TestApp it 10 times every 30min (can be a simple shell script with sleep)

  • wait 2h before going into grafana

  • After that test number of deploymets over the following periods:

  • last 2 hours, should give 0

  • previous 2h (so 4h ago -> 2h ago) should give 4 deployments;

  • last week - should give 10 deployments.

  • check scenario where we create 10 deployments using webhook exporter and then restart webhook exporter (just delete running webhook instance).

  • wait 10 min

  • check how many deployments happened in the last week

  • check how many deployments happened in the last 5 minutes

  • [ ] add clean up to deploytime, committime and failure exporter

  • This I think is't really required as the scrape always get the new Gauge object, so we could have this for free. Just need to check if the caching in committime requires some cleanup.

  • [ ] add clean up to webhook exporter (seems harder than on the old exporters)

  • Agree. There should be a lifetime of metric after which it is removed. The way I see it:

  • Collect metrics not older than X (as implemented in #943)

  • Add function to check if the time of event where it happened is not greater then Y and remove it if it's true.

  • [ ] add time window to exporters (except committime exporter)

  • As @weshayutin reworded above.

  • [ ] explain this new behavior in the docs to user

  • Agree.

mpryc commented 1 year ago

RE: cleanup functionality

I think we should create a new class for the metrics that inherits from GaugeMetricFamily, let's call it PelorusGaugeMetricFamily.

This class should have at least three additional methods:

def clear_all_metrics(self) -> int:
"""
Clears out all the in memory metrics from the GaugeMetricFamily.

Returns:
  - number of metrics removed
"""

def clear_metrics(self, threshold_sec: int) -> int:
"""
Clears out the in memory metrics from the GaugeMetricFamily, which are
above certain threshold 

Arg:
  - threshold_sec (int): The time threshold in seconds. All metrics that occurred `threshold_sec` seconds ago or earlier will be removed.

Returns:
  - number of metrics removed.
"""

def remove_metric(self, labels: Sequence[str]) -> int:
"""
Remove particular metric from the GaugeMetricFamily.
Reverse function to GaugeMetricFamily.add_metric().

Arg:
  - labels Sequence[str]: The labels representing metric to be removed.

Returns:
  - number of metrics removed.
"""