Good job is using a lot of memory / ram

rgaufman commented 2 years ago

Hi there,

I have my good_job set up like this:

    config.good_job = {
      preserve_job_records: false,
      retry_on_unhandled_error: false,
      on_thread_error: ->(exception) { Airbrake.notify(exception) },
      execution_mode: :external,
      queues: '*',
      max_threads: 10,
      poll_interval: 15,
      shutdown_timeout: 25,
      enable_cron: true,
      logger: Rails.logger
    }

Here is a screenshot of it in action:

There are rarely more than 3-4 jobs running in parallel.

Compared to Sidekiq running the same jobs, it seems to take almost double the ram and looks like this in top:

Any ideas what could be responsible for this? - maybe I should disable the dashboard? - How would I troubleshoot and reduce ram usage?

rgaufman commented 2 years ago

I noticed strange behaviour, when starting puma and good_job, both jump to 30% ram use. Then Puma starts to drop quite quickly and ends at around 9%. Good_job stays at around 29% (Sidekiq before was roughly the same as Puma with the same workers) - this is 10 minutes later:

(This is with the dashboard disabled and preserve_job_records: false)

Any ideas how to troubleshoot this and figure out why so much ram is being used?

bensheldon commented 2 years ago

@rgaufman thanks for sharing this! I haven't done much memory profiling of GoodJob, though I'm surprised that GoodJob would perform significantly differently than Puma. Take the following as speculation (because I haven't looked)...

GoodJob's initialization process is pretty simple:

Boots the Rails framework: https://github.com/bensheldon/good_job/blob/8b4fc67f0f78d3338c5df8e13851935ee6ca4812/lib/good_job/cli.rb#L158-L159
Initializes a bunch of Concurrent Ruby objects (ThreadExecutors, Timers): https://github.com/bensheldon/good_job/blob/8b4fc67f0f78d3338c5df8e13851935ee6ca4812/lib/good_job/cli.rb#L92-L99
Queries for existing and future jobs to cache <-- this is a potential culprit, but when I built the feature I saw it use ~20MB/10k cached items.
Anytime that an execution thread is awakened (e.g. to execute a job immediately over LISTEN/NOTIFY, a scheduled job from the cache, or every poll_interval) that thread takes place inside a Rails Executor which should be similar to how Puma works. One thing to flag here is that the poll_interval wakes up a thread and make an ActiveRecord query every N seconds.

Because I haven't done any memory profiling, I can't give you anything specific about how to dig into the problem. I think trying to get some stackprof points into the initialization would be a good place to start, and look at allocated objects. It sounds like there isn't a problem though with object retention (e.g. memory usage increases over time), just that there is more than expected in the first place.

BTW, here's memory usage for my reference application.

Web: Puma with web concurrency 1, rails max threads 5:
Worker: GoodJob with 4 threads (good job queues = -newsletters:2;newsletters:2)

sedubois commented 2 years ago

Since upgrading our app from GoodJob 2.15.1 to 3.0.0 our Heroku worker memory is spiking.

Worker memory, past 2 hours:

Worker memory, past 24 hours:

Worker memory, past 72 hours:

The web dyno memory usage hasn't changed. Other dependencies were also upgraded (Zeitwerk, rack, ...), but the suspect is GoodJob as it's the worker memory which has increased and an inspection of the latest memory spikes shows a close overlap with the execution time of the latest jobs in the GoodJob dashboard:

| GoodJob 'Scheduled At' time (UTC) | Memory spike time (1min or 10min resolution) (UTC) | | ------------- | ------------- | | 10:49:18 | 10:49-10:51 | | 10:36:50 | 10:37-10:39 | | 09:52:47 | 09:52-09:59:00 | | 09:28:26, 09:29:44 | 09:20-09:30 | | 08:21, 08:22, 08:25, 08:31 | 08:20-08:30 | | 07:42:40, 07:43:10, 07:46:38 | 07:40-07:50 | | 07:12:07, 07:19:40 | 07:10-07:20 | | 06:25:34, 06:27:04 | 06:20-06:30 | | 05:45:43, 05:50:42 | 05:40-05:50 |

The above table shows that jobs are infrequent. They either send some emails or make some API calls.

Here is the GoodJob initializer, for reference (in this release we also changed the syntax from GoodJob.X = Y into Rails.application.configure do ... end):

```rb # app/config/good_job.rb Rails.application.configure do config.good_job.preserve_job_records = true config.good_job.retry_on_unhandled_error = false config.good_job.on_thread_error = -> (exception) do Sentry.set_extras(good_job_on_thread_error: true) Sentry.capture_exception(exception) end end ActionMailer::MailDeliveryJob.retry_on StandardError, wait: :exponentially_longer, attempts: Float::INFINITY ActionMailer::MailDeliveryJob.around_perform do |_job, block| block.call rescue StandardError => e Sentry.set_extras(action_mailer_standard_error: true) Sentry.capture_exception(e) raise end ```

Any idea if/how GoodJob could cause this?

bensheldon commented 2 years ago

@sedubois yikes! That's not good.

I can't think of anything off the top of my head that would have blown out memory.

Looking through the Changelog, the major changes are:

converting GoodJob to a singular engine. I don't think this would vastly effect memory, though it does mean a few more controllers might be eager-loaded into memory at boot.
Zeitwerk configuration changes. Can't imagine anything here changing significantly unless Zeitwerk itself was doing something very unexpected, and i wouldn't expect that.
Automatically deleting old jobs was a feature changed to default enabled in GoodJob 3.0. Every 10 minutes it deletes old jobs. I think this could be the culprit, but I tried to do as much work as possible in the database rather than the app.

Let me do another sweep through the code and see what I can find.

bensheldon commented 2 years ago

@sedubois I found a possible culprit in #652 which has been released in v3.0.1. Please give that a try and let me know if that fixes your problems.

sedubois commented 2 years ago

@bensheldon yesterday I downgraded GoodJob from 3.0.0 to 2.15.1 and the memory issue had disappeared. Today I upgraded to 3.0.1 and there have also not been any memory spikes since. So it seems solved, thanks! 👍

bensheldon commented 2 years ago

@sedubois fantastic! 🎉

bensheldon / good_job

Good job is using a lot of memory / ram #613