discourse / prometheus_exporter

A framework for collecting and aggregating prometheus metrics
MIT License
525 stars 153 forks source link

Added basic GoodJob collectors #280

Closed AndersGM closed 1 year ago

AndersGM commented 1 year ago

I need metrics for GoodJob, so I've given it a shot.

SamSaffron commented 1 year ago

looking good but I think there is a bit of fine tuning we need here:

see: https://prometheus.io/docs/practices/naming/

there is a total postfix missing and some of these should be Counter vs Gauge no?

AndersGM commented 1 year ago

@SamSaffron Thank you very much for taking your time to look at my PR!

I've never implemented my own metrics, so I might very well be wrong. I chose gauges for all of the metrics since data is read from the database, and jobs can be destroyed and retried at will. Consequently, the numbers can actually decrement for all the metrics, which isn't allowed for counters (AFAIK).

To get actual counters, I think we'll need to either extend GoodJob or count through callbacks.

Let me know what you think!

SamSaffron commented 1 year ago

be sure to read up about it, counters are "smart" they can go back to 0, it is fine, prom adjusts to it.

Gauges are for things where "rate" does not matter like memory. Counters are for things where "rate" matters, like counting stuff.

AndersGM commented 1 year ago

be sure to read up about it, counters are "smart" they can go back to 0, it is fine, prom adjusts to it.

Yes - my point is that since, with my suggested implementation, we read data directly from the database, the numbers would never really be reset, but rather (potentially) decrease by some number. That could be when the user manually delete some of the jobs through GoodJob's user interface (so the number of succeeded jobs could go from 200 to 199). It could also be when jobs older than 14days are deleted through automatic clean ups. Counters can only ever increment or be reset to zero - never decrease :-)

I completely agree that counters would be favorable for the use case, however, we'll need to build on top of GoodJob (eg. count each time a job has completed). I can see there is already some work going on in GoodJob about adding stats, that we can then read from, but I think we'll wait quite some time before its ready, so this is best effort at the time of writing AFAIK :-)

SamSaffron commented 1 year ago

I see! will merge this then, naming is not ideal prom likes to recommend _total. But it seems reasonably cohesive with our existing naming

AndersGM commented 1 year ago

@SamSaffron Agree! At least _total is prohibited for gauge: https://github.com/discourse/prometheus_exporter/blob/8d77c53333b5793f26d8e90bbeecdab850acc3f8/lib/prometheus_exporter/metric/gauge.rb#L9

I will try and work on a better implementation in the GoodJob project.

lauer commented 5 months ago

@AndersGM did you find a solution to handle that those counters can decrease doing clean up script? As I see it, the problem would persist in this new pr #302

AndersGM commented 5 months ago

@lauer I implemented metrics for GoodJob in this PR to address the issue: https://github.com/bensheldon/good_job/pull/984 - haven't looked into it since. The file is now improved and moved to https://github.com/bensheldon/good_job/blob/main/lib/good_job/job_performer/metrics.rb to support MultiScheduler.

As far as I can see I forgot about it, and never actually implemented it in prometheus_exporter. Consequently, it is still just using the database counts. I'll try and look into using the metrics from GoodJob later today - it should be straight forward to use the metrics there :-)

lauer commented 5 months ago

okay, an other issue we have, is that the SQL queries around database count is showed when we are using rails console.