Prometheus metrics support

qw1mb0 commented 5 years ago

Good day.

Thank you very much for the supercronic - this is really very cool software.

However, we encountered a problem that we can’t find out if cron ended up with an error.

Do you plan to support metrics that prometheus can pick up? With information about the finished/failed tasks.

sevagh commented 5 years ago

I'm interested in this. I came up with a quick and dirty PR to add a metric, supercronic_cron_success, with the labels command, schedule, position, which emits 1.0 on success and 0.0 on failure:

Fake crontab, elko is a bogus command that doesn't exist:

sevagh:supercronic $ cat fake.crontab
*/1 * * * * * * echo "sevag test"

*/1 * * * * * * elko "sevag test"

Running supercronic:

sevagh:supercronic $ ./supercronic  -prometheus-listen-address=":9091" ./fake.crontab
INFO[2019-05-21T16:22:33-07:00] read crontab: ./fake.crontab             
INFO[2019-05-21T16:22:34-07:00] starting                                      iteration=0 job.command="echo \"sevag test\"" job.position=0 job.schedule="*/1 * * * * * *"
INFO[2019-05-21T16:22:34-07:00] starting                                      iteration=0 job.command="elko \"sevag test\"" job.position=1 job.schedule="*/1 * * * * * *"
INFO[2019-05-21T16:22:34-07:00] sevag test                                    channel=stdout iteration=0 job.command="echo \"sevag test\"" job.position=0 job.schedule="*/1 * * * * * *"
INFO[2019-05-21T16:22:34-07:00] /bin/sh: elko: command not found              channel=stderr iteration=0 job.command="elko \"sevag test\"" job.position=1 job.schedule="*/1 * * * * * *"
INFO[2019-05-21T16:22:34-07:00] job succeeded                                 iteration=0 job.command="echo \"sevag test\"" job.position=0 job.schedule="*/1 * * * * * *"
ERRO[2019-05-21T16:22:34-07:00] error running command: exit status 127        iteration=0 job.command="elko \"sevag test\"" job.position=1 job.schedule="*/1 * * * * * *"

The metrics endpoint:

supercronic_cron_success{command="echo \"sevag test\"",position="0",schedule="*/1 * * * * * *"} 1
supercronic_cron_success{command="elko \"sevag test\"",position="1",schedule="*/1 * * * * * *"} 0

If the maintainers are interested in this I'd be happy to take feedback.

sevagh commented 5 years ago

https://github.com/aptible/supercronic/compare/master...sevagh:master

qw1mb0 commented 5 years ago

@sevagh Can you add a metric with information about the duration of the task?

sevagh commented 5 years ago

Sure. This gets a bit messy though. Duration could be the value of a gauge:

supercronic_cron_success{command="echo \"sevag test\"",position="0",schedule="*/1 * * * * * *"} 1
supercronic_cron_success{command="elko \"sevag test\"",position="1",schedule="*/1 * * * * * *"} 0
supercronic_cron_duration{command="echo \"sevag test\"",position="0",schedule="*/1 * * * * * *"} 0.3
supercronic_cron_duration{command="elko \"sevag test\"",position="1",schedule="*/1 * * * * * *"} 0.4

Also, one thing to note is that these are the "latest" values - latest success status, latest duration, etc. You would need some PromQL expressions to look at historical durations.

qw1mb0 commented 5 years ago

It is very cool! Can you make a PR? @krallin what do you think about it?

sevagh commented 5 years ago

It is very cool! Can you make a PR?

I didn't actually implement the duration part - that was just an example paste. Still figuring it out.

krallin commented 5 years ago

It is very cool! Can you make a PR? @krallin what do you think about it?

I think this is a good idea and certainly worthwhile!

Looking at @sevagh 's code, here are a few things I would suggest with regard to what is being monitored:

We should expose a gauge representing how many instances of a given job are running at a given time. We can increment this at the start of the the runThisJob closure defined in StartJob then defer decrementing it.
We should add counters for executions, failures and successes (as opposed to the "last execution" gauge you have right now — more on this below). We can increment the relevant counters in the runThisJob closure defined in StartJob.
We should add a counter for jobs instances that exceed their deadline? We can increment this in monitorJob when the deadline is hit.
Measuring timings is a little tricky, since Prometheus wants us to aggregate those metrics to an extent... However, it's also very useful! I suppose we could use a histogram here with some arbitrary but reasonable buckets (e.g. 10 seconds, 30 seconds, 1 minute, 2 minutes, 5 minutes, 10 minutes, 30 minutes, 1 hour?)

This brings me to one concern I have with your current implementation, which is the "last execution" metric you're exposing right now. I'm not a big fan of this approach.

Here's why: "last execution" isn't a well-defined concept when you might a) have overlapping executions of jobs and b) your jobs may execute multiple times (or no times at all) between Prometheus scrapes.

For example, if your job executes every 30 seconds, but you poll every 2 minutes, then looking at the last execution reports in Prometheus, an end-user could be missing errors completely. If we use a failure counter instead, end-users can use e.g. the increase() function to accurately report on failures instead.

To summarize: I'm concerned that reporting on last execution provides users with a convenient footgun that might come back to bite them later on. I would prefer to encourage them to take the time to understand what Supercronic is actually doing, even if this means a little more setup work to use those metrics.

Let me know what you think?

As far as the code itself is concerned, here are a few minor comments:

We should make sure to shutdown the http sever when Supercronic exits.
Can we reset metrics when Supercronic's configuration is reloaded? We probably should, or the results won't make sense.

Separately, we should check how much bigger the binary becomes with this additional dependency. This isn't something we have made guarantees about so far, but I'd like to know what we're getting into :)

sevagh commented 5 years ago

I like all of the suggestions and will work on them ASAP.

Prometheus has an opinionated exposition format, and sometimes there's a bit of mental gymnastics required to get metrics to align with the KPIs of a specific codebase. I appreciate the detailed suggestions.

sevagh commented 5 years ago

These are looking much better with @krallin's suggestions:

# HELP supercronic_currently_running count of currently running cron executions
# TYPE supercronic_currently_running gauge
supercronic_currently_running{command="echo \"sevag test\"",position="0",schedule="*/1 * * * * * *"} 0
supercronic_currently_running{command="elko \"sevag test\"",position="1",schedule="*/1 * * * * * *"} 0
supercronic_currently_running{command="sleep 10",position="2",schedule="*/5 * * * * * *"} 1
# HELP supercronic_deadline_exceeded count of exceeded deadline cron executions
# TYPE supercronic_deadline_exceeded counter
supercronic_deadline_exceeded{command="sleep 10",position="2",schedule="*/5 * * * * * *"} 2
# HELP supercronic_executions count of cron executions
# TYPE supercronic_executions counter
supercronic_executions{command="echo \"sevag test\"",position="0",schedule="*/1 * * * * * *"} 16
supercronic_executions{command="elko \"sevag test\"",position="1",schedule="*/1 * * * * * *"} 16
supercronic_executions{command="sleep 10",position="2",schedule="*/5 * * * * * *"} 1
# HELP supercronic_failed_executions count of failed cron executions
# TYPE supercronic_failed_executions counter
supercronic_failed_executions{command="elko \"sevag test\"",position="1",schedule="*/1 * * * * * *"} 16
# HELP supercronic_successful_executions count of successul cron executions
# TYPE supercronic_successful_executions counter
supercronic_successful_executions{command="echo \"sevag test\"",position="0",schedule="*/1 * * * * * *"} 16
supercronic_successful_executions{command="sleep 10",position="2",schedule="*/5 * * * * * *"} 1

aptible / supercronic

Prometheus metrics support #50