Betterment / delayed

a multi-threaded, SQL-driven ActiveJob backend used at Betterment to process millions of background jobs per day
MIT License
156 stars 9 forks source link

Fix vanishing thread data between deserialising and performing an ActiveJob instance #36

Open caius opened 10 months ago

caius commented 10 months ago

Change Delayed::JobWrapper from directly invoking :execute callbacks and job#perform_now, to instead calling ActiveJob::Base.execute and letting ActiveJob handle the callbacks, job deserialisation and performing internally.

To explain why, we've observed a change in behaviour between delayed_job's queue adapter and delayed's queue adapter with thread variables being reset between job#deserialize and job#perform being called from the same worker thread/process.

Some background context around the Rails executor and ActiveSupport::CurrentAttributes is useful. Feel free to skip down a couple of paragraphs if you already understand them!

The ActiveJob Railtie has an initialiser that wraps all job executions with the Rails executor (it calls app.reloader in the code, but that's effectively the executor still.) This ensures code is reloaded between jobs running in development mode, but also makes sure any executor callbacks are invoked when the reloader block is entered at runtime in any environment. (If you've ever wondered how things in rails get reset between web requests, this mechanism is the answer.)

The ActiveSupport Railtie has an initialiser that resets ActiveSupport::CurrentAttributes instances when entering an executor block. This means before every job is run, all subclasses/instances of ActiveSupport::CurrentAttributes have their values cleared which stops things leaking between requests/jobs/blocks of work. This effectively lets you have thread-global variables that are cleaned up once the current unit of work has finished, without you needing to remember.

In our app we're using the Rails default Current model which is a subclass of ActiveSupport::CurrentAttributes to store some request-global information about an external service to use. (This allows us to override that on a per-request basis in non-production environments for testing purposes.) We pass that override down into the jobs at enqueue time by overriding ApplicationJob#serialize to save the value of Current.service_url into the job data hash.

class ApplicationJob < ActiveJob::Base
  def serialize
    super.merge("service_url" => Current.service_url)
  end

  def deserialize(job_data)
    super
    Current.service_url = job_data.delete("service_url")
  end
end

Our code in SomeJob#perform and elsewhere in the app invoked from the job can then call Current.service_url to find the overridden value. Doing it this way means the behaviour is shared between all jobs that need that overridden url without having to remember to override serialize/deserialize in each job when required.

Switching from delayed_job to delayed broke this mechanism silently 😬. No errors were raised but our override wasn't being observed in the code. Debugging it we observed within the ApplicationJob#deserialize method Current.service_url was being set correctly, but by the time the job#perform method was called Current.service_url was nil again. The case of the vanishing variable?! πŸ™€

At this point we deduced it was likely an ordering issue around ActiveSupport::CurrentAttributes being reset and investigated further in that direction. After littering puts statements through delayed and ActiveJob I think we've figured out why this is occurring and how it differs to delayed_job's behaviour.

We see JobWrapper#job being invoked early in the worker method in Delayed::Worker#run, via Delayed::Worker#max_run_time seeing if the job instance responds to #max_run_time. DJ Hooks will also use the job instance (payload_object). JobWrapper forwards all missing methods to the job method, which deserialises the job object from the data.

The JobWrapper#perform method then re-uses the @job memoized instance in when invoking execute callbacks (which reset all the ActiveSupport::CurrentAttributes values) then calls job.perform_now without invoking #deserialize again. This means we've deserialised the job, then invoked the Rails executor entry callbacks, then called job#perform. Subtle, but unexpected (to us at least) behaviour change.

This PR changes JobWrapper#perform to call ActiveJob::Base.execute(job_data) to have ActiveJob invoke callbacks, deserialize the job and perform it in the correct order. In delayed's case it means job#deserialize will be called twice, due to the worker interrogating the job instance before getting to calling JobWrapper#perform. The crucial change is that deserialisation happens after the rails executor callbacks, so any override to job#deserialize expecting to be run in the same "unit of work" as the job#perform method will work.

For reference, the delayed_job adapter doesn't deserialize the job_data before calling ActiveJob::Base.execute so doesn't have this ordering effect in it. The tradeoff is delayed_job's adapter also doesn't allow overriding max_run_time, etc in the job class which is useful.

I can't think of a way to solve this without either breaking the API allowing max_run_time methods on job classes (aside from perhaps making them class methods?) or having the side-effect of deserializing the job before the Rails executor entry which can lead to seemingly weird behaviour when using standard Rails executor aware objects in this way.

(For ease of linking to the source code, the above description links to Rails 7.1.2 source code, but holds true back to Rails 5.0 for ActiveJob and Rails 5.2 for ActiveSupport::CurrentAttributes. Aka the first version each feature appeared respectively.)

caius commented 10 months ago

The order we observe on 0.5.2 from lots of puts debugging is:

We now arrive at OurJob#perform with Current having only nil values even though we set them in ApplicationJob#deserialize.

smudge commented 9 months ago

@caius it's been a busy couple of weeks for me, but I haven't forgotten this PR! I intend to do some testing of this branch, but your reasoning is sound, and I suspect that because much of our DJ infrastructure predates ActiveSupport::CurrentAttributes (instead we have other ways of sharing & clearing job-thread-global state), we hadn't encountered this inconsistency yet.

caius commented 9 months ago

@smudge no worries, it's not exactly a simple thing to think through and understand so wasn't expecting it to be a quick thing. Let me know if you need anything more explaining to help πŸ‘πŸ»

bmt-github-policybot commented 7 months ago

This pull request has been automatically closed because it has not been updated in the last month. :sleepy:

If you still need this change, you can reopen it.

Thanks for helping keep our house in order!

caius commented 7 months ago

If you still need this change, you can reopen it.

Don't appear to be able to do that, bit of a hostile experience as an outside contributor if I'm honest.

jmileham commented 7 months ago

Hi Caius, I'm the CTO at Betterment and I wanted to say sorry for the unpleasant experience. This was a result of a change we made for our internal software dev workflows, and we haven't been thoughtful about the impact on our OSS repos. We'll take that away and see what to do. Genuinely appreciate the contribution.

caius commented 7 months ago

@jmileham appreciate the note, I didn't mean my original message to come across as curt as I suspect it might have done re-reading it. Thanks :)

jmileham commented 7 months ago

Not at all, just wanted to make sure you know we see the unpleasant experience and appreciate the effort you've put in. Sorry we haven't been able to engage as quickly as we'd like with this PR.

caius commented 7 months ago

Noticed CI had a linting error for parallel assignment, fixed that up whilst here πŸ™ƒ