Load Dashboard nav count badges asynchronously

bensheldon commented 1 year ago

When loading the GoodJob dashboard, querying for navigation badge counts is the largest source of request latency. I think this could be improved in 2 ways:

Moving the badge counts into an asynchronous request and creating a lightweight stimulus controller that would fetch and insert the values into the HTML. The Stimulus controller should likely live at the body level so that it controls the entire page and can make a single request (or maybe 2, one for the global nav, and one specifically for the content).
Writing some gnarly Active Record to combine what is currently 13 (!) queries for individual counts into a single (or at least less) queries. Ideally it doesn't look like this unmaintainable block of sql from a previous project. I'm not sure if it would involve CTEs or Unions, but ideally as much as possible stays in Active Record and doesn't drop down to SQL.

jeremyf commented 10 months ago

First, thank you for writing up the succinct issue and for maintaining a kick-ass ActiveJob adapter. With Good Jobs we've noticed a stability compared to other options; we don't lose jobs (as in evaporate with no trace) that were in process during a server upgrade or what have you.

My team has definitely noticed the slow-down; and @orangewolf asked that I hop over and find if there's a filed issue (or write one).

I've been tasked with looking into this along with how we would best clean-up our completed jobs. We've had two issues, observing that if we don't clean-up completed jobs:

We see overall performance issues.
And the dashboard for good jobs becomes unresponse

jqr commented 9 months ago

I haven't had a chance to do it yet, but I think some CTEs that limit the results to something reasonable is an easyish first step.

So instead of showing 10M, what about 10K+? Users could configure whatever that limit ought to be, but 10k seems like a very reasonable first guess that should usually be plenty fast.

Alternatively, Rails 7.1 can now do async count queries to issue a lot of these in parallel, but this is only going to improve the situation on the margins, when egregiously over reasonable limits this is just going to hammer the database with many simultaneous queries.

bensheldon commented 9 months ago

For anyone wanting to make an attempt at this, please do the global navbar first (Jobs, Batches, Cron, Processes) and open a PR. I much prefer small, even incomplete implementations for discussion.

jnunemaker commented 7 months ago

Writing some gnarly Active Record to combine what is currently 13 (!) queries for individual counts into a single (or at least less) queries. Ideally it doesn't look like this unmaintainable block of sql from a previous project. I'm not sure if it would involve CTEs or Unions, but ideally as much as possible stays in Active Record and doesn't drop down to SQL.

CTEs could help. I've also done filter to do multiple different counts in one query on the same data.

I always approach problems like this from a fewer and faster standpoint.

Fewer - Too many count queries as outlined above. I think we can move a few of them into one query for fewer trips across the wire and through the data on the db side.
Faster - Counting too many records. As @jqr said, the biggest improvement would be to limit the number of records you'll count. Counting 100k+ rows several times is never fast.

I'll put this on my list to take a look at. All the counts have made our UI pretty slow even with me reducing the amount of jobs data we keep around.

FWIW, I started on this a while back but then stopped. The main reason was that speeding it up really required duplicating a lot of the logic that's wrapped up nicely in scopes currently. I didn't really want to duplicate it and didn't really have a nice way to not duplicate it. I'll take a look again and see if rails CTEs or something like that could help with that problem now.

jnunemaker commented 7 months ago

Did a little work on this today. I did get this working:

Screenshot 2024-03-04 at 9 41 47 AM

I queued up 400k jobs, but note many of the counts are limited to 1k. But I'll have to test it out in prod to see how much it really helps (or not).

Some of the counts are easy to limit (total, batches, filters) but the counts that group have no way to speed up -- mostly because of how dynamic the query can be (by queue, job class in serialized param, general search) and that counts + group make it impossible to avoid scanning all the counted data. Moving N+1 counts could help with that but now you have way more trips over the network.

A couple ideas I'm curious if there is any feedback on:

Open PR with current tweaks to adjust top navbar and filter bar.
config.fast_mode = true - open to a better name 😆. if this is true then just skip the counts other than the graph. That gets rid of most of the slowness but still gives you some idea of what's going on. I can throw up a PR to respond to. Shouldn't take long. Could likely use the counts from the graph query to populate the job class select counts.
Make some new class responsible for counting all this data. Move to some big CTE or some other attempt at counting over the same data repeatedly as it does now.

I'm leaning toward 1 or whipping up a quick version of 2, especially now that I have all the counts in appsignal anyway.

@bensheldon any gut feelings or preferences?

bensheldon commented 7 months ago

@jnunemaker I appreciate you working on this. I'm still thinking about this, but wanted to share the direction of those thoughts.

I lean towards decoupling things, with the code design question being "how easy is it to stop making the queries in this controller action, and move the queries to a different (asynchronous) controller action?" So if reusing the counts makes things more coupled right now, I don't think I want to go in that direction.

That written, I encourage you to open up a PR so I'm not speculating so much 😊

On options/naming, I would want to be descriptive in the name, as clunky as that may be e.g. config.dashboard_reused_counts or config.dashboard_innacurate_counts, but I'd also rather not add an option and instead just document "If you have more than 100k jobs, counts will be less accurate in deference to performance" or something like that (though I don't think you're proposing to make them less accurate).

If I am proposing to make them less accurate, I'd want to explore something like this:

Michael Fuhr came up with a clever trick to run EXPLAIN for a query and parse its output.

Also, just because this thread is getting long, I want to offer that I'm not seeking complex/clever solutions (as fun as ^^ may be). I am just hoping someone else can do (earler/faster than me on my todo list) what I think the first step is of extraction and refactoring, that makes all the fun stuff much easier.

One PR already that I'm very appreciative of, but seems to have been bogged down after feedback: https://github.com/bensheldon/good_job/pull/1231

tcannonfodder commented 6 months ago

An idea that I had for how we could manage this:

`GoodJob::Stats` model

This could be a model with the following:

A unique (enforced by a database index) key string.
A bigint count

We could then UPSERT (or use UPSERT-like behavior) to increment/decrement the stat; which we use in the dashboard.

The key being a basic (but unique) string would allow us to create metrics dynamically; such as the following for a single successful job execution:

jobs.succeeded: +1
queue.$queue_name.succeeded: +1
jobs.$job_name.succeeded: +1
jobs.$job_name.succeeded: +1
jobs.$job_name.queue.succeeded: +1

But that may be too complex/costly. At the very least; having the top-level counts for jobs counts + queue counts would be useful.

And I think that folks would understand if there was an info message that explained that these metrics are cached to make the dashboard performant, and may not be exactly precise. IMO, most folks are either looking for:

A specific job to debug/disentangle/retry. This is where the data table comes in!
The overall system health. As long as the metrics are relatively fresh and mostly accurate, this gets folks what they need!

And there could also be a rake task to clear/rebuild the stats if needed.

I'm not sure if it would involve CTEs or Unions, but ideally as much as possible stays in Active Record and doesn't drop down to SQL

And I know you'd mentioned this (and it's valid!); but we could add database triggers to improve the performance versus traditional callbacks if needed

jnunemaker commented 6 months ago

@bensheldon 👍 I'll get back to this and whip something together when I have time.

I think the CTE route could work because we could select 1 for each of the various AR scopes with limits and then do counts on those expressions to reduce the number of counted rows.

We could then UPSERT (or use UPSERT-like behavior) to increment/decrement the stat; which we use in the dashboard.

@tcannonfodder doing this around every job would be too intense from a number of writes standpoint, especially if we are incrementing multiple things (1 * N network calls for every job). Probably the only way to handle this would be to buffer and write results on a regular basis (N writes per process per interval) but I doubt Ben wants to get into that. Unless I'm misunderstanding something.

I just know resque did something like this and part of my fixes when scaling it in various parts of GitHub were using a metrics services (statsd/datadog) for metrics instead of redis incrs. I noop'd the incrs and we saw dramatic improvements. I'd assume the same or worse for Postgres.

Anything that ends up 1:1 or 1:N with # of jobs will enqueueing down. Even the trigger is an extra hit.

tcannonfodder commented 6 months ago

Shoot; yeah. :/ It was a long-shot.

bensheldon commented 6 months ago

Thank you both for the ideas! I'm in agreement with @jnunemaker that keeping a separate stats table would probably be too hot for single rows to be kept updated. And maybe I'm being overly optimistic, but I want to imagine this is the sort of thing a relational database should be good at 😰

Fortunately we're in a good place right now where the counts have been extracted into two controller actions, so we have a bit more room to be creative: https://github.com/bensheldon/good_job/blob/1acb7218a35b0ff868be8f26cf9407672b6a4c5d/app/controllers/good_job/metrics_controller.rb

tcannonfodder commented 6 months ago

Of course! Good Job is doing a Great Job (👈 😎👈) and want to do my part to help out!

Piggybacking of @jnunemaker's idea: thoughts on using a materialized view based on the CTE proposed for the counts? That could reduce the number of times the database needs to be hit to generate those stats; and allow people to customize how often the view is materialized if that's needed.

bensheldon / good_job