bensheldon / good_job

Multithreaded, Postgres-based, Active Job backend for Ruby on Rails.
https://goodjob-demo.herokuapp.com/
MIT License
2.67k stars 199 forks source link

`GoodJob.on_thread_error` not called in tests #1102

Closed jdudley1123 closed 1 year ago

jdudley1123 commented 1 year ago

I have GoodJob (v3.19.4) on Rails (v6.1.7.6) configured to log errors:

Rails.application.configure do
  config.good_job = {
    on_thread_error: ->(e) { ErrorLogger.log(e.message) },
  }
end

That all seems to work fine, but when I try to write a test for it, the test always fails.

require 'rails_helper'

RSpec.describe 'GoodJob' do
  let(:job) { ActiveDirectoryIdChangeRequestJob } # This could be any job
  let(:error_message) { 'An error has occurred!' }

  describe 'when an error occurs during a job' do
    it 'logs an error' do
      allow(ErrorLogger).to receive(:log)
      allow(Person).to receive(:find).and_raise(RuntimeError, error_message)
      job.perform_later

      expect(ErrorLogger).to have_received(:log).with(error_message)
    end
  end
end

However, if I do the error logging in ApplicationJob, the test passes:

class ApplicationJob < ActiveJob::Base
  around_perform do |_job, block|
    block.call
  rescue RuntimeError => e
    ErrorLogger.log(e.message)
  end
end

That suggests to me that GoodJob.on_thread_error isn't being called, but I can't figure out why not. The documentation says that it won't be called if there are an infinite number of retries, but in my set up there are no retries. What am I missing?

bensheldon commented 1 year ago

Just a thought before I dig in deeper: can you double check that the queue adapter in your test case is GoodJob's and not the Active Job Test Adapter?

jdudley1123 commented 1 year ago

config/environments/test.rb has config.active_job.queue_adapter = :good_job, and checking the value of ActiveJob::Base.queue_adapter during the test returns an instance of ActiveJob::QueueAdapters::GoodJobAdapter

bensheldon commented 1 year ago

Thanks! Can you doublecheck the value of ActiveDirectoryIdChangeRequestJob.queue_adapter? I want to specifically rule out:

jdudley1123 commented 1 year ago

ActiveDirectoryIdChangeRequestJob.queue_adapter is also returning an instance of ActiveJob::QueueAdapters::GoodJobAdapter I'm afraid

bensheldon commented 1 year ago

Ok, could you try adding your stub assertion directly onto GoodJob._on_thread_error? (with the leading underscore) That will help me see if the problem here is the configuration initializing at the wrong time.

allow(GoodJob).to receive(:_on_thread_error)
allow(Person).to receive(:find).and_raise(RuntimeError, error_message)
job.perform_later

expect(GoodJob).to have_received(:_on_thread_error).with(error_message)
jdudley1123 commented 1 year ago

Updating the test to include those lines:

  describe 'when an error occurs during a job' do
    it 'logs an error' do
      allow(ErrorLogger).to receive(:log)
      allow(GoodJob).to receive(:_on_thread_error)
      allow(Person).to receive(:find).and_raise(RuntimeError, error_message)
      job.perform_later

      expect(GoodJob).to have_received(:_on_thread_error).with(error_message)
      expect(ErrorLogger).to have_received(:log).with(error_message)
    end
  end

I get:

  Failures:

  1) GoodJob when an error occurs during a job logs an error
     Failure/Error: expect(GoodJob).to have_received(:_on_thread_error).with(error_message)

       (GoodJob)._on_thread_error("An error has occurred!")
           expected: 1 time with arguments: ("An error has occurred!")
           received: 0 times

Thanks so much for taking the time to investigate this šŸ™‚

bensheldon commented 1 year ago

Hmm, this is really strange. I'm not quite sure why that isn't being triggered. I just added a test to simulate this on my side: https://github.com/bensheldon/good_job/pull/1104

Would you be able to inspect the enqueued GoodJob Job? You could do GoodJob::Job.last to give you the record. Does it have the error info?

jdudley1123 commented 1 year ago

Inspecting GoodJob::Job.last after job.perform_later is called gets me the following, with no evidence of the error:

#<GoodJob::Job:0x00007fad9661deb0
 id: "9adb9586-35f9-42e9-94cb-1d180af7b895",
 queue_name: "default",
 priority: 0,
 serialized_params:
  {"job_id"=>"9adb9586-35f9-42e9-94cb-1d180af7b895",
   "locale"=>"en",
   "priority"=>nil,
   "timezone"=>"UTC",
   "arguments"=>[],
   "job_class"=>"ActiveDirectoryIdChangeRequestJob",
   "executions"=>0,
   "queue_name"=>"default",
   "enqueued_at"=>"2023-10-12T09:00:09Z",
   "provider_job_id"=>nil,
   "exception_executions"=>{}},
 scheduled_at: Thu, 12 Oct 2023 09:00:09.654835000 UTC +00:00,
 performed_at: Thu, 12 Oct 2023 09:00:09.670314000 UTC +00:00,
 finished_at: Thu, 12 Oct 2023 09:00:12.078229000 UTC +00:00,
 error: nil,
 created_at: Thu, 12 Oct 2023 09:00:09.654835000 UTC +00:00,
 updated_at: Thu, 12 Oct 2023 09:00:12.080652000 UTC +00:00,
 active_job_id: "9adb9586-35f9-42e9-94cb-1d180af7b895",
 concurrency_key: nil,
 cron_key: nil,
 retried_good_job_id: nil,
 cron_at: nil,
 batch_id: nil,
 batch_callback_id: nil,
 is_discrete: true,
 executions_count: 1,
 job_class: "ActiveDirectoryIdChangeRequestJob",
 error_event: nil>
bensheldon commented 1 year ago

@jdudley1123 huh! That would imply to me that there is a rescue_from or some other around_perform block that is intercepting the exception. My reading of that job data is that the job completed successfully (meaning no exception was raised all the way up to the adapter).

jdudley1123 commented 1 year ago

Apologies, I think I muddled some things up while trying various different configurations.

Job:

# app/jobs/active_directory_id_change_request_job.rb
class ActiveDirectoryIdChangeRequestJob < ApplicationJob
  def perform()
    Person.find(1)
  end
end

GoodJob config:

# config/initalizers/hood_job.rb
Rails.application.configure do
  config.good_job = {
    on_thread_error: ->(e) { ErrorLogger.log(e.message) },
  }
end

Test:

# spec/jobs/job_error_logging_spec.rb
require 'rails_helper'

RSpec.describe 'GoodJob' do
  let(:job) { ActiveDirectoryIdChangeRequestJob } # This could be any job
  let(:error_message) { 'An error has occurred!' }

  describe 'when an error occurs during a job' do
    it 'logs an error' do
      allow(ErrorLogger).to receive(:log)
      allow(GoodJob).to receive(:_on_thread_error)
      allow(Person).to receive(:find).and_raise(RuntimeError, error_message)
      begin
        job.perform_later
      rescue
      end
      # Checking GoodJob::Job.last here

      expect(GoodJob).to have_received(:_on_thread_error).with(error_message)
      expect(ErrorLogger).to have_received(:log).with(error_message)
    end
  end
end

GoodJob::Job.last output

#<GoodJob::Job:0x00007f8c7d5e4780
 id: "1ad5541f-a63b-45d2-85d0-8dcac6da1af7",
 queue_name: "default",
 priority: 0,
 serialized_params:
  {"job_id"=>"1ad5541f-a63b-45d2-85d0-8dcac6da1af7",
   "locale"=>"en",
   "priority"=>nil,
   "timezone"=>"UTC",
   "arguments"=>[],
   "job_class"=>"ActiveDirectoryIdChangeRequestJob",
   "executions"=>0,
   "queue_name"=>"default",
   "enqueued_at"=>"2023-10-19T16:30:32Z",
   "provider_job_id"=>nil,
   "exception_executions"=>{}},
 scheduled_at: Thu, 19 Oct 2023 16:30:32.727841000 UTC +00:00,
 performed_at: Thu, 19 Oct 2023 16:30:32.778566000 UTC +00:00,
 finished_at: Thu, 19 Oct 2023 16:30:35.056818000 UTC +00:00,
 error: "RuntimeError: An error has occurred!",
 created_at: Thu, 19 Oct 2023 16:30:32.727841000 UTC +00:00,
 updated_at: Thu, 19 Oct 2023 16:30:35.062804000 UTC +00:00,
 active_job_id: "1ad5541f-a63b-45d2-85d0-8dcac6da1af7",
 concurrency_key: nil,
 cron_key: nil,
 retried_good_job_id: nil,
 cron_at: nil,
 batch_id: nil,
 batch_callback_id: nil,
 is_discrete: true,
 executions_count: 1,
 job_class: "ActiveDirectoryIdChangeRequestJob",
 error_event: 1>

Test outcome:

 Failures:

  1) GoodJob when an error occurs during a job logs an error
     Failure/Error: expect(GoodJob).to have_received(:_on_thread_error).with(error_message)

       (GoodJob)._on_thread_error("An error has occurred!")
           expected: 1 time with arguments: ("An error has occurred!")
           received: 0 times
bensheldon commented 1 year ago

Sorry, I think I may have also sent you on a goose chase too.

Seeing your begin / rescue block in your test made me realize: oh, this is running in :inline mode, which means that GoodJob raises exceptions; GoodJob doesn't report exceptions in :inline mode to on_thread_error because it's not running in a background thread.

If you wanted to test that (and I don't believe it's necessary) you could do this as a test:


RSpec.describe do
  let(:job) { ActiveDirectoryIdChangeRequestJob } # This could be any job
  let(:error_message) { 'An error has occurred!' }

  describe 'when an error occurs during a job' do
    it 'logs an error' do
      capsule = GoodJob::Capsule.new(configuration: GoodJob::Configuration.new({ max_threads: 5, queue_string: '*' }))
      adapter = GoodJob::Adapter.new(execution_mode: :async_all, _capsule: capsule)

      allow(ActiveDirectoryIdChangeRequestJob).to receive(:queue_adapter).and_return adapter
      allow(ErrorLogger).to receive(:log)
      allow(Person).to receive(:find).and_raise(RuntimeError, error_message)

      job.perform_later

      expect(GoodJob::Job.count).to eq 1 # double-check it got enqueued
      sleep 0.1 until GoodJob::Job.finished.count == 0
      capsule.shutdown

      expect(ErrorLogger).to have_received(:log).with(error_message)
    end
  end
end
jdudley1123 commented 1 year ago

Hmm, that test still doesn't pass for me ā€“ the failure is the final assertion about the error logger receiving the error message, which suggests GoodJob still isn't calling on_thread_error even in :async_all mode.

bensheldon commented 1 year ago

Here's my Calendly if we can debug it together: https://calendly.com/bensheldon/office-hours

bensheldon commented 1 year ago

@jdudley1123 that was a fun and fruitful debugging session!

To summarize, and hopefully come up in search for anyone in the future, we discovered:

The Application and Rspec tests used DatabaseCleaner to create transactional fixtures (DatabaseCleaner.strategy = :transaction), and disabled Rails' built-in transactional fixtures.

Unfortunately DatabaseCleaner's transactional fixtures do not have the incredibly helpful feature of Rails 5.1+ sharing a database connection across all threads.

The core problem: Within the test, the job was enqueued and the job record inserted in the test, in an open transaction within one database connection. But then a background thread used a different database connection to try to query the job record, and never found it, because it was in the transaction opened by the other database connection.

My suggestion was to remove DatabaseCleaner because I don't think it's necessary anymore because Rails test fixtures are so good and work with System/Integration tests, and if you do truly need to truncate, it's rather trivial to do directly with Rails as GoodJob does itself.

jdudley1123 commented 12 months ago

Thanks so much for all your help @bensheldon! The test passes nicely now, and removing Database Cleaner had no effect on the rest of the suite.