bensheldon / good_job

Multithreaded, Postgres-based, Active Job backend for Ruby on Rails.
https://goodjob-demo.herokuapp.com/
MIT License
2.57k stars 191 forks source link

Good job is using a lot of memory / ram #613

Closed rgaufman closed 1 year ago

rgaufman commented 2 years ago

Hi there,

I have my good_job set up like this:

    config.good_job = {
      preserve_job_records: false,
      retry_on_unhandled_error: false,
      on_thread_error: ->(exception) { Airbrake.notify(exception) },
      execution_mode: :external,
      queues: '*',
      max_threads: 10,
      poll_interval: 15,
      shutdown_timeout: 25,
      enable_cron: true,
      logger: Rails.logger
    }

Here is a screenshot of it in action:

Screenshot 2022-05-29 at 20 01 51

There are rarely more than 3-4 jobs running in parallel.

Compared to Sidekiq running the same jobs, it seems to take almost double the ram and looks like this in top:

Screenshot 2022-05-29 at 20 02 55

Any ideas what could be responsible for this? - maybe I should disable the dashboard? - How would I troubleshoot and reduce ram usage?

rgaufman commented 2 years ago

I noticed strange behaviour, when starting puma and good_job, both jump to 30% ram use. Then Puma starts to drop quite quickly and ends at around 9%. Good_job stays at around 29% (Sidekiq before was roughly the same as Puma with the same workers) - this is 10 minutes later:

Screenshot 2022-05-29 at 20 38 41

(This is with the dashboard disabled and preserve_job_records: false)

Any ideas how to troubleshoot this and figure out why so much ram is being used?

bensheldon commented 2 years ago

@rgaufman thanks for sharing this! I haven't done much memory profiling of GoodJob, though I'm surprised that GoodJob would perform significantly differently than Puma. Take the following as speculation (because I haven't looked)...

GoodJob's initialization process is pretty simple:

  1. Boots the Rails framework: https://github.com/bensheldon/good_job/blob/8b4fc67f0f78d3338c5df8e13851935ee6ca4812/lib/good_job/cli.rb#L158-L159
  2. Initializes a bunch of Concurrent Ruby objects (ThreadExecutors, Timers): https://github.com/bensheldon/good_job/blob/8b4fc67f0f78d3338c5df8e13851935ee6ca4812/lib/good_job/cli.rb#L92-L99
  3. Queries for existing and future jobs to cache <-- this is a potential culprit, but when I built the feature I saw it use ~20MB/10k cached items.
  4. Anytime that an execution thread is awakened (e.g. to execute a job immediately over LISTEN/NOTIFY, a scheduled job from the cache, or every poll_interval) that thread takes place inside a Rails Executor which should be similar to how Puma works. One thing to flag here is that the poll_interval wakes up a thread and make an ActiveRecord query every N seconds.

Because I haven't done any memory profiling, I can't give you anything specific about how to dig into the problem. I think trying to get some stackprof points into the initialization would be a good place to start, and look at allocated objects. It sounds like there isn't a problem though with object retention (e.g. memory usage increases over time), just that there is more than expected in the first place.

BTW, here's memory usage for my reference application.

Screen Shot 2022-05-30 at 7 53 41 AM
sedubois commented 2 years ago

Since upgrading our app from GoodJob 2.15.1 to 3.0.0 our Heroku worker memory is spiking.

Worker memory, past 2 hours:

Screenshot 2022-07-02 at 13 26 02

Worker memory, past 24 hours:

Screenshot 2022-07-02 at 13 22 48

Worker memory, past 72 hours:

Screenshot 2022-07-02 at 12 56 15

The web dyno memory usage hasn't changed. Other dependencies were also upgraded (Zeitwerk, rack, ...), but the suspect is GoodJob as it's the worker memory which has increased and an inspection of the latest memory spikes shows a close overlap with the execution time of the latest jobs in the GoodJob dashboard:

| GoodJob 'Scheduled At' time (UTC) | Memory spike time (1min or 10min resolution) (UTC) | | ------------- | ------------- | | 10:49:18 | 10:49-10:51 | | 10:36:50 | 10:37-10:39 | | 09:52:47 | 09:52-09:59:00 | | 09:28:26, 09:29:44 | 09:20-09:30 | | 08:21, 08:22, 08:25, 08:31 | 08:20-08:30 | | 07:42:40, 07:43:10, 07:46:38 | 07:40-07:50 | | 07:12:07, 07:19:40 | 07:10-07:20 | | 06:25:34, 06:27:04 | 06:20-06:30 | | 05:45:43, 05:50:42 | 05:40-05:50 |

The above table shows that jobs are infrequent. They either send some emails or make some API calls.

Here is the GoodJob initializer, for reference (in this release we also changed the syntax from GoodJob.X = Y into Rails.application.configure do ... end):

```rb # app/config/good_job.rb Rails.application.configure do config.good_job.preserve_job_records = true config.good_job.retry_on_unhandled_error = false config.good_job.on_thread_error = -> (exception) do Sentry.set_extras(good_job_on_thread_error: true) Sentry.capture_exception(exception) end end ActionMailer::MailDeliveryJob.retry_on StandardError, wait: :exponentially_longer, attempts: Float::INFINITY ActionMailer::MailDeliveryJob.around_perform do |_job, block| block.call rescue StandardError => e Sentry.set_extras(action_mailer_standard_error: true) Sentry.capture_exception(e) raise end ```

Any idea if/how GoodJob could cause this?

bensheldon commented 2 years ago

@sedubois yikes! That's not good.

I can't think of anything off the top of my head that would have blown out memory.

Looking through the Changelog, the major changes are:

Let me do another sweep through the code and see what I can find.

bensheldon commented 2 years ago

@sedubois I found a possible culprit in #652 which has been released in v3.0.1. Please give that a try and let me know if that fixes your problems.

sedubois commented 2 years ago

@bensheldon yesterday I downgraded GoodJob from 3.0.0 to 2.15.1 and the memory issue had disappeared. Today I upgraded to 3.0.1 and there have also not been any memory spikes since. So it seems solved, thanks! 👍

bensheldon commented 2 years ago

@sedubois fantastic! 🎉