Gracefully handle `NameError` while workers are restarting

jonleighton commented 1 week ago

Here's a deployment scenario we encountered:

New code is deployed to web servers first
Web servers start enqueuing a new job that has been added in the deploy
Old worker processes start picking up those jobs, which results in a NameError
Worker processes restart
The new processes would now be capable of executing the jobs, but they don't because the jobs were not retried. We must manually reschedule these jobs to resolve the problem.

We cannot configure retries for this error via Active Job, because that requires the job constant to be deserialized in order to instantiate the job instance. (The error occurs here in Active Job.)

It would be nice if Good Job could gracefully handle this somehow. I could turn on config.retry_on_unhandled_error, but I don't want to because of the potential downsides of that.

Perhaps this config option could be extended to allow error types to be specified:

config.retry_on_unhandled_error = [NameError]

However, I feel like this would still only be useful if there were some kind of exponential backoff built in. (Or at least a retry interval at the bare minimum, so we don't DoS ourselves.)

Earlopain commented 4 days ago

I'm always very concious when making changes to jobs precisely because of this. Adding a job, removing one, changing arguments can all lead to trouble when the server and worker don't agree on what is what. I usually make multiple deploys to sidestep this (even with sidekiq which I know would handle these errors for me, just seems safer).

That said, I think some kind of implicit error handling for this does make sense. Sidekiq for example has a retry mechanism where it retries a few times over a month or so. For errors happening before the job even has a chance to start, I think that makes sense. retry_on_unhandled_error doesn't seem that well-equiped for it since you wouldn't be able to distinguish from a NameError from the job itself or during deserialization. What do you think @bensheldon?

bensheldon commented 4 days ago

I’m open to it.

Just thinking about the situations I encounter are:

NameErrors when I introduce a new job and during a deploy rollout
NameErrors when I remove a job class definition but there are still job records in the system
ArgumentErrors when I change the perform method signature but there are still jobs records in the system (or during deploy rollout if I did try to drain existing records previously)

I feel like I’d want to scope this particular issue to “deploy rollout” related problems, and do our best to not mask the other problems (or at least not mask them for too long)

That makes me think that the config would be like “config.graceful_job_rollout_period = 15.minutes” which would have the effect of:

retrying NameErrors and ArgumentErrors. Can we look at the callers stack and really narrow it down to just job deserialization?
not get too fancy with rescheduling and bump the job schedule back by 1 minute to retry

(I jammed ArgumentErrors into this, but feel free to exclude if you think that’s fundamentally different)

Or, a totally different thought: should we rethink “retry_on_unhandled_error”? It could take a proc with the error as argument, and the proc could return the number of seconds to wait to retry. (Heaven knows what happens if the proc itself raises an exception).

I don’t love this idea because I frequently feel like people interpret “advanced usage” as something to aspire to. And I rather build explicit features.

But I want to open the door in case there are lots of other use cases that would benefit from a more generic change.

Earlopain commented 4 days ago

retrying NameErrors and ArgumentErrors. Can we look at the callers stack and really narrow it down to just job deserialization?

For NameError, I think it would be nice if there was a specific error that Rails raises when that happens. There isn't but shouldn't be difficult to add. Maybe just DeserializationError since that's already there.

Changing retry_on_unhandled_error

Wouldn't that just reimplement what you can do with rescue_from/retry_job? I'm not so sure about precedence if you already have other retry_on/discard_on hooks but you can probably make it work. There are very few places where these won't fire so I don't know if it would make sense to extend.

More generally, this would help with maybe 90% of problems. If a job is deleted but still enqueued, no matter the retries it will always fail. Same if an argument is removed or added without the job being prepared for this. It helps with newly enqueued jobs but really you need to be mindful of already enqueued jobs and probably do these changes across multiple deploys.

bensheldon commented 4 days ago

Thanks for those comments! That's helpful. And reminds me why these things are slightly different:

DeserializationError happens once there is already an ActiveJob::Base instance, so it can use the Active Job retry-handlers that exist there
NameError happens before the instance is created, so it can't use the Active Job retry-handlers because they exist on the instance.

So I guess I'll reduce the scope back down to: would be nice to handle NameError specifically from job deserialization. I think it makes sense for Active Job to raise a special error. Though maybe slightly different than DeserializationError to distinguish between:

this was a parameter deserialization error raised from the job and unhandled by the job.
this is a job class error that couldn't have been handled (though maybe still Active Job should still have config of what should happen).

jonleighton commented 4 days ago

I agree that this should just be about NameError, as the other two can be retried via Active Job.

though maybe still Active Job should still have config of what should happen

Yeah, I reckon in an ideal world Active Job would have some kind of story around how this is handled, because it's going to crop up for any queue adapter. Maybe it would honour the retry semantics set on ApplicationJob or ActiveJob::Base.

Earlopain commented 3 days ago

I agree that this should just be about NameError, as the other two can be retried via Active Job.

That's a good point. Although, it would still be somewhat difficult to distinquish between the error happening inside your job and before it executes. ArgumentError contains very little information, so I think you would need to look at the callstack. Or I guess you can just assume ArgumentError doesn't happen in your own code

For now, I openend https://github.com/rails/rails/pull/53770

bensheldon / good_job

Gracefully handle `NameError` while workers are restarting #1546