HangfireIO / Hangfire

An easy way to perform background job processing in .NET and .NET Core applications. No Windows Service or separate process required
https://www.hangfire.io
Other
9.31k stars 1.69k forks source link

Recurring jobs not being executed when running as part of integration tests using a 'test host' configured with Hangfire per test case #2082

Open bhehe opened 2 years ago

bhehe commented 2 years ago

I've been struggling with a testing scenario that I've not been able to figure out so I'm seeking help/input on how to triage further and any thoughts on the possible root-cause/issue that would yield the behavior being described.

Issue

Context

Observed Behaviors (variations & fixes that have been tried)

Notes

Questions

bhehe commented 2 years ago

New Findings

On a hunch, I switched the test code back to using a local SQL database instance instead of the in-memory storage provider. With that change, all the tests pass when ran as a group/set. So the issue I'm chasing is definitely related to the use of the InMemory provider.

I need to go back in the history to the original version of things, but I suspect back when those original tests were working it was using Sqlite and either file-based backing or it was using Sqlite's own in-memory support.

bhehe commented 2 years ago

I've created a related issue over in the 'InMemory' repo's issues tracker as it's starting to sound like it may be more appropriate to chase it there.

odinserj commented 2 years ago

Please show me the code that create recurring jobs or use background jobs. I will check whether static RecurringJob or BackgroundJob classes are used and in this case they will need to be changed to IRecurringJobManager- and IBackgroundJobClient-based services. Because static classes will cache previous instance of InMemoryStorage and will not invalidate it to the current instance that's created by another test case. So they will write recurring jobs to the previous storage instead of the new one.

bhehe commented 2 years ago

Let me get you some code extracts but also for context, I did try an experiment where I registered all jobs so even in a "1st one wins" scenario they should have been there for triggering/executing as I was suspecting something about internal/static state management being a factor.

bhehe commented 2 years ago

So for the 1st question on how we register the jobs; Yes, we do use the static class/method. So if I'm understanding you, I would need to pass in the IServiceProvider and get an instance of the IRecurringJobManager and use that instead to register things to get around the issue.

            RecurringJob.AddOrUpdate<TJob>(
                recurringJobId: registeredJobName,
                methodCall: (job) => job.Execute(default(CancellationToken)),
                cronExpression: JobConfiguration.JobCron,
                queue: JobConfiguration.JobQueue);

I think we could make this change fairly easily if required.

For the 2nd question, we are triggering the jobs via the static class/method, and now leverage the newer version that returns the jobId so we know we did successfully trigger it or not and we can track its progress/outcome.

            var jobId = RecurringJob.TriggerJob(registeredJobName);

So in this case for triggering execution, it sounds like we'd need to resolve the IBackgroundJobClient and use that for triggering/executing the jobs.

Alternately, assuming we can wait for a 'fix' to be released, is there a way to make the on-startup behavior of Hangfire itself just invalidate/clear any prior instances of the storage provider if present? Ensure we're starting with a 'clean slate' before the storage provider is configured via the HangfireGlobalConfiguration is exposed in the AddHangfire(..) method?

bhehe commented 2 years ago

Given your comment about the static state management, what isn't making sense to me is that with my experiment to register all jobs in all tests - why did the jobs not execute at all? The call to trigger them did return a value (it's checked for not-null, not empty, not white-space).

Would it be that the register/trigger code paths were seeing them present but the job scheduler did not - i.e. it was seeing a different storage provider instance.

odinserj commented 2 years ago

Trigger doesn't immediately execute the method behind a recurring job – it creates and enqueues a regular background job based on method and arguments in a recurring job. And then that background job is expected to be processed by background job server. And if it's not running or running for the previous storage instance, it will not be able to process the recurring job execution.

bhehe commented 2 years ago

Right. I understand that it's enqueued and not 'instant' in nature. I left my test code to 'wait' and allowed it over an hour.

So that's why I was assuming that somehow the scheduler was seeing a different storage than what the Register/Trigger static methods were seeing and using which is why I was able to both register the jobs (all) and trigger the single job the test was needing - but it never got scheduled/executed.

bhehe commented 2 years ago

So just to share an update, a coworker of mine is chasing this issue at the moment and he was examining the code in the RecurringJobManager class and how it is managing its internal state. We now see where the issues lies I believe.

What he found was the public parameterless constructor for RecurringJobManager is capturing a reference to the JobStorage.Current versus just referencing that directly & using the value. This approach of capturing the reference appears to be for avoiding the overhead of the lock/null check in the .Current property.

So the static instance of RecurringJob is internally managing its own singleton instance of the RecurringJobManager which in turn has a captured reference to the JobStorage.Current which is why we're seeing the behavior we get.

At one time, the original author had implemented things where the code (both for registering and triggering jobs) was always creating a new instance of the RecurringJobManager versus simply using the RecurringJob (static) class. Now we know why - there were zero clues/comments in that code as to the need to do this vs using the static type.

Once we switched the code to create an instance of the RecurringJobManager for each call (whether registering or triggering the jobs) the tests now execute correctly and are passing.

In our case, the overhead of creating an instance per call isn't an issue as it's only on startup (once per job) and the act of triggering the jobs is only done in our integration tests so it's not in any critical path/high performance code for us.

For now this should unblock us, but we'd still like to see a more long-term/permanent solution that wouldn't require us to recreate instances of the job manager class over & over and ideally we'd be coding against those static classes as that's what you see cited in documentation / examples / blogs / etc. and it seems like the intent is for those to be what is used.

I recognize that our use here isn't typical runtime usage but in the case of integration testing it doesn't seem like we are doing anything wrong either (imo) as we're just creating/using 'test host' instances.

adamdriscoll commented 7 months ago

Thanks. Was having the same problem and this fixed it. You'll also need to replace BackgroundJob with instances of BackgroundJobClient.