bensheldon / good_job

Multithreaded, Postgres-based, Active Job backend for Ruby on Rails.
https://goodjob-demo.herokuapp.com/
MIT License
2.52k stars 189 forks source link

Add debugging and telemetry stats #532

Open bensheldon opened 2 years ago

bensheldon commented 2 years ago

Have been getting GH Issues recently that would benefit from some basic stats and system info. It should be easy for someone to dump them to help debug issues.

Imagining a method like GoodJob.debug to be run from a console, imagining the following fields to start:

Seems like it would also be useful to have some counters on a Scheduler.stats too:

ckdake commented 1 year ago

The ability to configure some logging output here would go a long way. We're running in AWS as an ECS Service: loglines one an hour that showed data like either successful_jobs_counter=1234 or successful_jobs_since_last_logline=123 would let us use CloudWatch to build some of the metrics that we'd find handy.

Alternatively, a "log job run metrics" that let each job spit some metrics out to stdout like name=Job1, exit_code=1, queue_latency=1234ms,execution_duration=123ms would be great too.

bensheldon commented 1 year ago

Thanks for opening up #911. That is really helpful!

Also, I feel bad because now that I'm seeing what I wrote in this issue I'm like "hmm, what was I thinking?" 😓

Here are some thoughts that I want to bounce off of you. Sorry for the stream of consciousness:

ckdake commented 1 year ago

Telemetry is trick and a lot to get right. There is getting the metrics right, and exposing them. Here's a stream of consciousness back for you.

A lot of good work has been done by other people on other queue tools, and the "export metrics to Prometheus" is a common pattern. https://www.rabbitmq.com/monitoring.html#rabbitmq-metrics is a great example. It's probably not a word-for-word copy, but good_job seems to basically have the "Cluster" and "Node" and "Queue" pieces that rabbitmq does.

From the rabbitmq docs:

Most of the metrics represent point-in-time absolute values. Some, represent activity over a recent period of time (for example, GC runs and bytes reclaimed). The latter metrics are most useful when compared to their previous values and historical mean/percentile values.

I'd expect "current" metrics to be available from the processes within GoodJob (including both counters and gauges), but I would want to have "Write metrics all the time to the database" turned off to keep my database traffic down. (I know you have some other open issues about performance in large/high-volume installations of good_job -> I'd avoid making that worse with metrics)

Once you've got the metrics figured out, there are a couple of ways to expose them

  1. Exporting to a prometheus compatible endpoint that prometheus can poll would probably make a lot of people happy. When something polls that endpoint, it would calculate what it needs to (e.g. the 'flush' as needed) and return the results.
  2. Supporting writing the metrics on a scheule, via a cron (maybe as a job itself?), to stdout would make an overlapping set of people happy. This would need to be configurable, and because time is hard, including a timestamp with the metrics would be important and the "Exact" moment the log line written, less so. It wouldn't be great for these to get bogged down in situations with high load/throughput/queue-pressure so maybe not "as a job".

Scheduler#stats and Process.current_state both look interesting. I'm mainly interested in queue depth and latency for "dashboard" level graphs, and some of those more detailed things would be helpful for diagnosing problems.

As I mentioned, we're running CloudWatch off of stdout logs, so if GoodJob goes the prometheus route I'd have to wire up something to get from that endpoint into CloudWatch.

The easiest 'v0' starting point may be to wire up the existing metrics to a prometheus compatable http endpoint in the frontend piece. Maybe these should actually be available from GoodJob without the "GUI"?

bensheldon commented 1 year ago

The easiest 'v0' starting point may be to wire up the existing metrics to a prometheus compatable http endpoint in the frontend piece. Maybe these should actually be available from GoodJob without the "GUI"?

I'd prefer instead to build the minimum-satisficeable solution that solves your needs. People who need Prometheus can ask for it and be involved in that design.

I'm imagining that the data that you would need for cloudwatch would be the same metrics for a Prometheus endpoint, so I think the steps here would be:

  1. Identify the metrics
  2. Figure out the implementation for metrics.
    • If they require storage or are heavy to fetch (e.g. making database calls), I might ask whether there is an alternative.
    • I'm imagining they would need to be process-local, and the metrics system would need to sum/aggregate them to generate the cluster view.
  3. Then identify how to do the transport (e.g. logging, emitting, fetching)
ckdake commented 1 year ago

Yep, that all sounds perfectly reasonable. The metrics I'm interested in are essentially/currently the cluster-wide metrics that rabbitmq exposes here: https://www.rabbitmq.com/monitoring.html#cluster-wide-metrics We're using a single queue, but once we add more queues the "Individual Queue Metrics" will be interesting.

From poking around good_job source, most of these could probably each just be a single SQL query (maybe with some index changes?) that is run win .metrics() is called, and for me: the ability to schedule (every 5 minutes?) the contents of .metrics() getting logged to stdout would do the trick.

bensheldon commented 1 year ago

@ckdake could you pick 3 metrics from that list to start with? That would help me scope it down to an end-to-end implementation.

Also, some idle musings on this topic:

To give an example:

Screenshot 2023-04-21 at 8 29 34 AM

Here's one of my apps that has:

So that means I have total:

ckdake commented 1 year ago

A great start would be letting me answer "how many items have been added to this queue?" "how many items are in this queue?" and "how many items have been processed in this queue?". A helpful next question would be "how long does it take to process each item in this queue?" and "are there any stuck jobs?" (count of individual jobs "older" than some threshold).

bensheldon commented 1 year ago

Thanks. I think that's 5 😆 And apologies for being super pedantic here, but want to surface the design questions to work through 😄

ckdake commented 1 year ago

Its a fifo queue, so the first 3 would work :)

  1. for "how many in the queue". We are not (currently) using good_job to schedule jobs to happen at a particular time. It is the "waiting to execute" that will be most interesting there for starters for me. (gauge metric)
  2. I'm expecting "how many added" to be a total counter metric, increasing as each job is added. Not coupled to a timeperiod. Given that we're processing 100s of jobs and not 1000000000s of jobs, this is fine for our use case.
    • Future scheduled jobs would be entirely separate, and only counted here when they are queued for execution.
    • "retries" is a whole additional level of metrics. I would at some point like to know error/retry rates, but thats probably round 2
  3. I'm expecting "how many processes" to be a total counter too
  4. "latency"-> good question, I'm interested in all of the pieces, a good first version would be "time from when job is enqueued for execution until job is completed". Most of our jobs are super quick and are instrumented already with logs for job started -> job finished, so the piece good_job knows is how long they were queued before starting. mean,p95,max would all be helpful, and this would need a time range. if that time range approximately matched the polling interval of the monitoring tool, that might be ideal. Many tools use a very short polling interval (prometheus can do continual scraping, and requires special consideration over 2 minutes to avoid gaps in graphs), but I'd be good with a 5min rolling window. I'd expect the lantency to be for "completed jobs", so for you null example, that wouldn't be included in this metric. If the queue is piling up and not executing, I'd expect that to land in the first two metrics as a visible trend.
  5. Stuck. Yep, complicated to get right, but alerting on the max individual latency would do it. I'm not sure a great design to surface ALL latencies for ALL jobs from good_job to a monitoring tool, but surfacing a max_latency_in_past_X_timeframe from good_job could work
jtannas commented 1 year ago

Not sure if this is much help, but I just glued this together from the scopes on GoodJob::Job to give a summary count for the jobs on each queue.

WITH cte_statuses AS (
  SELECT
    CASE
      WHEN finished_at IS NOT NULL
        THEN (CASE WHEN error IS NULL THEN 'succeeded' ELSE 'discarded' END)
      WHEN COALESCE(scheduled_at, created_at) <= NOW()
        THEN (CASE WHEN pg_locks.locktype IS NULL THEN 'queued' ELSE 'running' END)
      WHEN (serialized_params ->> 'executions')::integer > 1
        THEN 'retried'
      ELSE
        'scheduled'
    END AS status,
    queue_name,
    priority
  FROM good_jobs
    LEFT JOIN pg_locks ON pg_locks.locktype = 'advisory'
    AND pg_locks.objsubid = 1
    AND pg_locks.classid = ('x' || substr(md5('good_jobs' || '-' || active_job_id::text), 1, 16))::bit(32)::int
    AND pg_locks.objid = (('x' || substr(md5('good_jobs' || '-' || active_job_id::text), 1, 16))::bit(64) << 32)::bit(32)::int
  WHERE retried_good_job_id IS NULL
)
SELECT status_list.status, queues.queue_name, COUNT(cte_statuses.status)
FROM 
  (VALUES ('succeeded'), ('discarded'), ('queued'), ('running'), ('retried'), ('scheduled')) AS status_list(status)
  CROSS JOIN (SELECT DISTINCT queue_name FROM good_jobs) AS queues
  LEFT JOIN cte_statuses ON cte_statuses.status = status_list.status
GROUP BY
  status_list.status,
  queues.queue_name;

The results get fed into metrics in our monitoring tool for pattern analysis, reporting, alerting, etc... That plus the duration information we can parse out the ActiveJob logs gets us what we need.

bensheldon commented 1 year ago

@jtannas fyi, rubygems.org does an interesting thing using the GoodJob::JobsFilter's queries: https://github.com/rubygems/rubygems.org/blob/63f6ba818aaa21138a82c7d57da8380d261c5193/app/jobs/good_job_statsd_job.rb#L4-L10

I want to share it, but also warn that I still consider those private objects, so subject to change and write a test 😅

AndersGM commented 1 year ago

@bensheldon It would be pretty useful to keep track of how many times a given job has been executed and even more, count retries and errors on a per job basis. https://github.com/bensheldon/good_job/pull/984 added overall counts. What do you think about incorporating this into metrics?

shouichi commented 9 months ago

I think we can use ActiveJob's instrumentation events for some metrics. For example, I believe execution time can be measured via perform.active_job. Because GoodJob is designed for ActiveJob, maybe they can be utilized?

https://guides.rubyonrails.org/active_support_instrumentation.html#active-job.

By the way, I wanted to expose metrics to Prometheus but realized that it's not so straightforward. Metrics such as execution time should be measured per process, so they can be scraped by Prometheus. But metrics such as queue length are global. It's strange to return those metrics from the individual process. An exporter may be needed for those metrics.

bagedevimo commented 9 months ago

Just dropping in to add we're also investigating options for reporting job latency & queued count, per queue, to cloudwatch in the interest of alerting and autoscaling. for our use case, the latency of the last started job in the queue is fine.

for a little context, we run a few different queues (critical, high, default, deferrable) and we really want to know if our critical queue is getting behind by more than a few minutes, and we'd like for our deferrable queue to just be permanently growing, if that makes sense.

IMG_6844

simi commented 7 months ago

:bulb: I just noticed https://github.com/discourse/prometheus_exporter#goodjob-metrics.

thatsmydoing commented 6 months ago

💡 I just noticed https://github.com/discourse/prometheus_exporter#goodjob-metrics.

Just to note, this isn't in the latest release version of prometheus_exporter (2.0.8).