Open bensheldon opened 2 years ago
The ability to configure some logging output here would go a long way. We're running in AWS as an ECS Service: loglines one an hour that showed data like either successful_jobs_counter=1234
or successful_jobs_since_last_logline=123
would let us use CloudWatch to build some of the metrics that we'd find handy.
Alternatively, a "log job run metrics" that let each job spit some metrics out to stdout like name=Job1, exit_code=1, queue_latency=1234ms,execution_duration=123ms
would be great too.
Thanks for opening up #911. That is really helpful!
Also, I feel bad because now that I'm seeing what I wrote in this issue I'm like "hmm, what was I thinking?" 😓
Here are some thoughts that I want to bounce off of you. Sorry for the stream of consciousness:
Scheduler
s are the smallest unit to track. A process that is executing jobs could have multiple schedulers. Though maybe the recently created Capsule
is the correct unit to track. Though a capsule can have multiple schedulers, and a process could have multiple capsules if someone was initializing them manually. Maybe the Capsule is the thing that has stats on it, and multiple schedulers increment a shared Stats object owned by the Capsule. (I'm imagining in this case the stats are like "number of jobs finished/errored/discarded/retried since creation"Scheduler#stats
has some good stuff. I wonder if that would be a thing to extend, and also inserted into the Process.current_state. That would definitely surface a lot more details... though the Process state doesn't get flushed to the database currently except at startup.Telemetry is trick and a lot to get right. There is getting the metrics right, and exposing them. Here's a stream of consciousness back for you.
A lot of good work has been done by other people on other queue tools, and the "export metrics to Prometheus" is a common pattern. https://www.rabbitmq.com/monitoring.html#rabbitmq-metrics is a great example. It's probably not a word-for-word copy, but good_job seems to basically have the "Cluster" and "Node" and "Queue" pieces that rabbitmq does.
From the rabbitmq docs:
Most of the metrics represent point-in-time absolute values. Some, represent activity over a recent period of time (for example, GC runs and bytes reclaimed). The latter metrics are most useful when compared to their previous values and historical mean/percentile values.
I'd expect "current" metrics to be available from the processes within GoodJob (including both counters and gauges), but I would want to have "Write metrics all the time to the database" turned off to keep my database traffic down. (I know you have some other open issues about performance in large/high-volume installations of good_job -> I'd avoid making that worse with metrics)
Once you've got the metrics figured out, there are a couple of ways to expose them
Scheduler#stats and Process.current_state both look interesting. I'm mainly interested in queue depth and latency for "dashboard" level graphs, and some of those more detailed things would be helpful for diagnosing problems.
As I mentioned, we're running CloudWatch off of stdout logs, so if GoodJob goes the prometheus route I'd have to wire up something to get from that endpoint into CloudWatch.
The easiest 'v0' starting point may be to wire up the existing metrics to a prometheus compatable http endpoint in the frontend piece. Maybe these should actually be available from GoodJob without the "GUI"?
The easiest 'v0' starting point may be to wire up the existing metrics to a prometheus compatable http endpoint in the frontend piece. Maybe these should actually be available from GoodJob without the "GUI"?
I'd prefer instead to build the minimum-satisficeable solution that solves your needs. People who need Prometheus can ask for it and be involved in that design.
I'm imagining that the data that you would need for cloudwatch would be the same metrics for a Prometheus endpoint, so I think the steps here would be:
Yep, that all sounds perfectly reasonable. The metrics I'm interested in are essentially/currently the cluster-wide metrics that rabbitmq exposes here: https://www.rabbitmq.com/monitoring.html#cluster-wide-metrics We're using a single queue, but once we add more queues the "Individual Queue Metrics" will be interesting.
From poking around good_job source, most of these could probably each just be a single SQL query (maybe with some index changes?) that is run win .metrics() is called, and for me: the ability to schedule (every 5 minutes?) the contents of .metrics() getting logged to stdout would do the trick.
@ckdake could you pick 3 metrics from that list to start with? That would help me scope it down to an end-to-end implementation.
Also, some idle musings on this topic:
*
, or queue1,queue2
or negative matching *,-queue2
. Trying to count "capacity" at the individual queue-level could be hard. (I'm somewhat haunted by trying to explain this in #624)To give an example:
Here's one of my apps that has:
whenever
queue (high latency)whenever
(low latency)So that means I have total:
whenever
4 threadsA great start would be letting me answer "how many items have been added to this queue?" "how many items are in this queue?" and "how many items have been processed in this queue?". A helpful next question would be "how long does it take to process each item in this queue?" and "are there any stuck jobs?" (count of individual jobs "older" than some threshold).
Thanks. I think that's 5 😆 And apologies for being super pedantic here, but want to surface the design questions to work through 😄
items in queues >= jobs enqueued
. Would you need to distinguish between "Jobs enqueued" and "jobs+retries enqueued?"total_latency = queuing_latency + runtime_latency
. Do you mean total_latency
? or breakdown (queuing latency is useful for identifying a lack of capacity; total latency is useful for calculating capacity for meeting SLOs) COALESCE(performed_at, NOW()) - scheduled_at
.Its a fifo queue, so the first 3 would work :)
Not sure if this is much help, but I just glued this together from the scopes on GoodJob::Job
to give a summary count for the jobs on each queue.
WITH cte_statuses AS (
SELECT
CASE
WHEN finished_at IS NOT NULL
THEN (CASE WHEN error IS NULL THEN 'succeeded' ELSE 'discarded' END)
WHEN COALESCE(scheduled_at, created_at) <= NOW()
THEN (CASE WHEN pg_locks.locktype IS NULL THEN 'queued' ELSE 'running' END)
WHEN (serialized_params ->> 'executions')::integer > 1
THEN 'retried'
ELSE
'scheduled'
END AS status,
queue_name,
priority
FROM good_jobs
LEFT JOIN pg_locks ON pg_locks.locktype = 'advisory'
AND pg_locks.objsubid = 1
AND pg_locks.classid = ('x' || substr(md5('good_jobs' || '-' || active_job_id::text), 1, 16))::bit(32)::int
AND pg_locks.objid = (('x' || substr(md5('good_jobs' || '-' || active_job_id::text), 1, 16))::bit(64) << 32)::bit(32)::int
WHERE retried_good_job_id IS NULL
)
SELECT status_list.status, queues.queue_name, COUNT(cte_statuses.status)
FROM
(VALUES ('succeeded'), ('discarded'), ('queued'), ('running'), ('retried'), ('scheduled')) AS status_list(status)
CROSS JOIN (SELECT DISTINCT queue_name FROM good_jobs) AS queues
LEFT JOIN cte_statuses ON cte_statuses.status = status_list.status
GROUP BY
status_list.status,
queues.queue_name;
The results get fed into metrics in our monitoring tool for pattern analysis, reporting, alerting, etc... That plus the duration information we can parse out the ActiveJob logs gets us what we need.
@jtannas fyi, rubygems.org
does an interesting thing using the GoodJob::JobsFilter
's queries: https://github.com/rubygems/rubygems.org/blob/63f6ba818aaa21138a82c7d57da8380d261c5193/app/jobs/good_job_statsd_job.rb#L4-L10
I want to share it, but also warn that I still consider those private objects, so subject to change and write a test 😅
@bensheldon It would be pretty useful to keep track of how many times a given job has been executed and even more, count retries and errors on a per job basis. https://github.com/bensheldon/good_job/pull/984 added overall counts. What do you think about incorporating this into metrics?
I think we can use ActiveJob's instrumentation events for some metrics. For example, I believe execution time can be measured via perform.active_job
. Because GoodJob is designed for ActiveJob, maybe they can be utilized?
https://guides.rubyonrails.org/active_support_instrumentation.html#active-job.
By the way, I wanted to expose metrics to Prometheus but realized that it's not so straightforward. Metrics such as execution time should be measured per process, so they can be scraped by Prometheus. But metrics such as queue length are global. It's strange to return those metrics from the individual process. An exporter may be needed for those metrics.
Just dropping in to add we're also investigating options for reporting job latency & queued count, per queue, to cloudwatch in the interest of alerting and autoscaling. for our use case, the latency of the last started job in the queue is fine.
for a little context, we run a few different queues (critical, high, default, deferrable) and we really want to know if our critical queue is getting behind by more than a few minutes, and we'd like for our deferrable queue to just be permanently growing, if that makes sense.
:bulb: I just noticed https://github.com/discourse/prometheus_exporter#goodjob-metrics.
💡 I just noticed https://github.com/discourse/prometheus_exporter#goodjob-metrics.
Just to note, this isn't in the latest release version of prometheus_exporter
(2.0.8).
Have been getting GH Issues recently that would benefit from some basic stats and system info. It should be easy for someone to dump them to help debug issues.
Imagining a method like
GoodJob.debug
to be run from a console, imagining the following fields to start:Seems like it would also be useful to have some counters on a
Scheduler.stats
too: