Closed qw1mb0 closed 5 years ago
I'm interested in this. I came up with a quick and dirty PR to add a metric, supercronic_cron_success
, with the labels command, schedule, position
, which emits 1.0
on success and 0.0
on failure:
Fake crontab, elko
is a bogus command that doesn't exist:
sevagh:supercronic $ cat fake.crontab
*/1 * * * * * * echo "sevag test"
*/1 * * * * * * elko "sevag test"
Running supercronic:
sevagh:supercronic $ ./supercronic -prometheus-listen-address=":9091" ./fake.crontab
INFO[2019-05-21T16:22:33-07:00] read crontab: ./fake.crontab
INFO[2019-05-21T16:22:34-07:00] starting iteration=0 job.command="echo \"sevag test\"" job.position=0 job.schedule="*/1 * * * * * *"
INFO[2019-05-21T16:22:34-07:00] starting iteration=0 job.command="elko \"sevag test\"" job.position=1 job.schedule="*/1 * * * * * *"
INFO[2019-05-21T16:22:34-07:00] sevag test channel=stdout iteration=0 job.command="echo \"sevag test\"" job.position=0 job.schedule="*/1 * * * * * *"
INFO[2019-05-21T16:22:34-07:00] /bin/sh: elko: command not found channel=stderr iteration=0 job.command="elko \"sevag test\"" job.position=1 job.schedule="*/1 * * * * * *"
INFO[2019-05-21T16:22:34-07:00] job succeeded iteration=0 job.command="echo \"sevag test\"" job.position=0 job.schedule="*/1 * * * * * *"
ERRO[2019-05-21T16:22:34-07:00] error running command: exit status 127 iteration=0 job.command="elko \"sevag test\"" job.position=1 job.schedule="*/1 * * * * * *"
The metrics endpoint:
supercronic_cron_success{command="echo \"sevag test\"",position="0",schedule="*/1 * * * * * *"} 1
supercronic_cron_success{command="elko \"sevag test\"",position="1",schedule="*/1 * * * * * *"} 0
If the maintainers are interested in this I'd be happy to take feedback.
@sevagh Can you add a metric with information about the duration of the task?
Sure. This gets a bit messy though. Duration could be the value of a gauge:
supercronic_cron_success{command="echo \"sevag test\"",position="0",schedule="*/1 * * * * * *"} 1
supercronic_cron_success{command="elko \"sevag test\"",position="1",schedule="*/1 * * * * * *"} 0
supercronic_cron_duration{command="echo \"sevag test\"",position="0",schedule="*/1 * * * * * *"} 0.3
supercronic_cron_duration{command="elko \"sevag test\"",position="1",schedule="*/1 * * * * * *"} 0.4
Also, one thing to note is that these are the "latest" values - latest success status, latest duration, etc. You would need some PromQL expressions to look at historical durations.
It is very cool! Can you make a PR? @krallin what do you think about it?
It is very cool! Can you make a PR?
I didn't actually implement the duration part - that was just an example paste. Still figuring it out.
It is very cool! Can you make a PR? @krallin what do you think about it?
I think this is a good idea and certainly worthwhile!
Looking at @sevagh 's code, here are a few things I would suggest with regard to what is being monitored:
runThisJob
closure defined in StartJob
then defer
decrementing it.runThisJob
closure defined in StartJob
.monitorJob
when the deadline is hit.This brings me to one concern I have with your current implementation, which is the "last execution" metric you're exposing right now. I'm not a big fan of this approach.
Here's why: "last execution" isn't a well-defined concept when you might a) have overlapping executions of jobs and b) your jobs may execute multiple times (or no times at all) between Prometheus scrapes.
For example, if your job executes every 30 seconds, but you poll every 2 minutes, then looking at the last execution reports in Prometheus, an end-user could be missing errors completely. If we use a failure counter instead, end-users can use e.g. the increase()
function to accurately report on failures instead.
To summarize: I'm concerned that reporting on last execution provides users with a convenient footgun that might come back to bite them later on. I would prefer to encourage them to take the time to understand what Supercronic is actually doing, even if this means a little more setup work to use those metrics.
Let me know what you think?
As far as the code itself is concerned, here are a few minor comments:
Separately, we should check how much bigger the binary becomes with this additional dependency. This isn't something we have made guarantees about so far, but I'd like to know what we're getting into :)
I like all of the suggestions and will work on them ASAP.
Prometheus has an opinionated exposition format, and sometimes there's a bit of mental gymnastics required to get metrics to align with the KPIs of a specific codebase. I appreciate the detailed suggestions.
These are looking much better with @krallin's suggestions:
# HELP supercronic_currently_running count of currently running cron executions
# TYPE supercronic_currently_running gauge
supercronic_currently_running{command="echo \"sevag test\"",position="0",schedule="*/1 * * * * * *"} 0
supercronic_currently_running{command="elko \"sevag test\"",position="1",schedule="*/1 * * * * * *"} 0
supercronic_currently_running{command="sleep 10",position="2",schedule="*/5 * * * * * *"} 1
# HELP supercronic_deadline_exceeded count of exceeded deadline cron executions
# TYPE supercronic_deadline_exceeded counter
supercronic_deadline_exceeded{command="sleep 10",position="2",schedule="*/5 * * * * * *"} 2
# HELP supercronic_executions count of cron executions
# TYPE supercronic_executions counter
supercronic_executions{command="echo \"sevag test\"",position="0",schedule="*/1 * * * * * *"} 16
supercronic_executions{command="elko \"sevag test\"",position="1",schedule="*/1 * * * * * *"} 16
supercronic_executions{command="sleep 10",position="2",schedule="*/5 * * * * * *"} 1
# HELP supercronic_failed_executions count of failed cron executions
# TYPE supercronic_failed_executions counter
supercronic_failed_executions{command="elko \"sevag test\"",position="1",schedule="*/1 * * * * * *"} 16
# HELP supercronic_successful_executions count of successul cron executions
# TYPE supercronic_successful_executions counter
supercronic_successful_executions{command="echo \"sevag test\"",position="0",schedule="*/1 * * * * * *"} 16
supercronic_successful_executions{command="sleep 10",position="2",schedule="*/5 * * * * * *"} 1
Good day.
Thank you very much for the supercronic - this is really very cool software.
However, we encountered a problem that we can’t find out if cron ended up with an error.
Do you plan to support metrics that prometheus can pick up? With information about the finished/failed tasks.