Recurring jobs not being executed when running as part of integration tests using a 'test host' configured with Hangfire per test case

bhehe commented 2 years ago

I've been struggling with a testing scenario that I've not been able to figure out so I'm seeking help/input on how to triage further and any thoughts on the possible root-cause/issue that would yield the behavior being described.

Issue

tests run just fine when ran 1-by-1 individually; We can arrange a 'test host' with Hangfire, register the recurring job, trigger it and then 'wait' for the outcome and then perform our assertions.
tests will run as a 'set' of test cases but only the 1st one is ran successfully/passing; The others fail to pass but the 'test host' executes/runs fine - the test fails when I'm checking for the recurring job to have been executed (regardless of the actual outcome, I'm just checking for 'did it run/exit from the job class/method called' at the point I detect an issue).

Context

we are using xUnit tests that are being ran by the VS test runner; All tests that use this approach are marked to not execute concurrently with other test classes via xUnit's collection attribute & the DisableParallelization=true in that attribute.
a discrete instance of a 'generic host' is created & arranged for each test case/method being executed; These are being managed and disposed in the individual methods so the lifetime & disposal of the host & Hangfire should be scoped accordingly.
the 'test host' instance is arranged with Hangfire, using the InMemory storage provider
the recurring job is then registered with Hangfire; each test is then only registering the 1 job it is focused on.
we then trigger the recurring job for execution.

Observed Behaviors (variations & fixes that have been tried)

initially we had helper code to trigger execution of a recurring job that was using the older Trigger(jobName) method which could silently fail to trigger the jobs since no result was returned.
we were using an instance-per-call of the RecurringJobManager class (the original author used this approach, unsure why this was chosen vs using the static RecurringJob class and its Trigger(jobName) method)
in recent efforts, we switched to the newer string TriggerExecution(jobName) method which does return a result indicating it successfully invoked the job or not.
at this time that newer method is only available on the RecurringJobManager class, not the static RecurringJob class; This means we are managing the lifetime of that component instance in our helper code.
after that change, we began to see a new clue as to our problem - the 1st test would successfully trigger the job but the 2nd and subsequent ones all returned a null value indicating it could not even trigger the job!
next we tried an experiment, we made the instance of the RecurringJobManager class managed as singleton in our own helper code, and modelled how that was done off the code in the RecurringJob static class.
that change resulted in a new behavior; Now we can trigger the job - a value is returned from TriggerExecution(jobName) -but- nothing appears to happen; The job never gets ran - even with a long wait time being used in the test.
we're tracking the start/end of jobs via a server filter so I can tell that we never hit the OnPerforming method in these cases.
I've even tried a variation of allowing the tests to arrange "all jobs" and only trigger the one the individual test cases cares about but that didn't change the behavior.
Originally we were using the in-memory package from another 3rd party, not the one from the Hangfire team itself, but we've since switched to using the 'official' version from Hangfire itself but that didn't fix the issue.

Notes

at one time this testing approach was working, but at some point changes were made that resulted in the tests starting to flap and we've not been able to identify any 'root cause' set of changes that would account for this change in the fundamental behavior.

Questions

one question I'm looking at is how to get better insights into why the job isn't being ran / isn't hitting the OnPerforming method in the server filter. I'm going to look into setting the logging to use trace-level and see if that offers any more clues.
in the current logging, I'm not seeing any indications of a failure or issue that would explain why the job isn't getting executed.
the other question I'm thinking of is would it be expected to see a change in behavior as described above when I'm using a per-method-call instance of RecurringJobManager vs managing a singleton instance of it in my helper code/class.

bhehe commented 2 years ago

New Findings

On a hunch, I switched the test code back to using a local SQL database instance instead of the in-memory storage provider. With that change, all the tests pass when ran as a group/set. So the issue I'm chasing is definitely related to the use of the InMemory provider.

I need to go back in the history to the original version of things, but I suspect back when those original tests were working it was using Sqlite and either file-based backing or it was using Sqlite's own in-memory support.

bhehe commented 2 years ago

I've created a related issue over in the 'InMemory' repo's issues tracker as it's starting to sound like it may be more appropriate to chase it there.

odinserj commented 2 years ago

Please show me the code that create recurring jobs or use background jobs. I will check whether static RecurringJob or BackgroundJob classes are used and in this case they will need to be changed to IRecurringJobManager- and IBackgroundJobClient-based services. Because static classes will cache previous instance of InMemoryStorage and will not invalidate it to the current instance that's created by another test case. So they will write recurring jobs to the previous storage instead of the new one.

bhehe commented 2 years ago

Let me get you some code extracts but also for context, I did try an experiment where I registered all jobs so even in a "1st one wins" scenario they should have been there for triggering/executing as I was suspecting something about internal/static state management being a factor.

bhehe commented 2 years ago

So for the 1st question on how we register the jobs; Yes, we do use the static class/method. So if I'm understanding you, I would need to pass in the IServiceProvider and get an instance of the IRecurringJobManager and use that instead to register things to get around the issue.

            RecurringJob.AddOrUpdate<TJob>(
                recurringJobId: registeredJobName,
                methodCall: (job) => job.Execute(default(CancellationToken)),
                cronExpression: JobConfiguration.JobCron,
                queue: JobConfiguration.JobQueue);

I think we could make this change fairly easily if required.

For the 2nd question, we are triggering the jobs via the static class/method, and now leverage the newer version that returns the jobId so we know we did successfully trigger it or not and we can track its progress/outcome.

            var jobId = RecurringJob.TriggerJob(registeredJobName);

So in this case for triggering execution, it sounds like we'd need to resolve the IBackgroundJobClient and use that for triggering/executing the jobs.

Alternately, assuming we can wait for a 'fix' to be released, is there a way to make the on-startup behavior of Hangfire itself just invalidate/clear any prior instances of the storage provider if present? Ensure we're starting with a 'clean slate' before the storage provider is configured via the HangfireGlobalConfiguration is exposed in the AddHangfire(..) method?

bhehe commented 2 years ago

Given your comment about the static state management, what isn't making sense to me is that with my experiment to register all jobs in all tests - why did the jobs not execute at all? The call to trigger them did return a value (it's checked for not-null, not empty, not white-space).

Would it be that the register/trigger code paths were seeing them present but the job scheduler did not - i.e. it was seeing a different storage provider instance.

odinserj commented 2 years ago

Trigger doesn't immediately execute the method behind a recurring job – it creates and enqueues a regular background job based on method and arguments in a recurring job. And then that background job is expected to be processed by background job server. And if it's not running or running for the previous storage instance, it will not be able to process the recurring job execution.

bhehe commented 2 years ago

Right. I understand that it's enqueued and not 'instant' in nature. I left my test code to 'wait' and allowed it over an hour.

So that's why I was assuming that somehow the scheduler was seeing a different storage than what the Register/Trigger static methods were seeing and using which is why I was able to both register the jobs (all) and trigger the single job the test was needing - but it never got scheduled/executed.

bhehe commented 2 years ago

So just to share an update, a coworker of mine is chasing this issue at the moment and he was examining the code in the RecurringJobManager class and how it is managing its internal state. We now see where the issues lies I believe.

What he found was the public parameterless constructor for RecurringJobManager is capturing a reference to the JobStorage.Current versus just referencing that directly & using the value. This approach of capturing the reference appears to be for avoiding the overhead of the lock/null check in the .Current property.

So the static instance of RecurringJob is internally managing its own singleton instance of the RecurringJobManager which in turn has a captured reference to the JobStorage.Current which is why we're seeing the behavior we get.

At one time, the original author had implemented things where the code (both for registering and triggering jobs) was always creating a new instance of the RecurringJobManager versus simply using the RecurringJob (static) class. Now we know why - there were zero clues/comments in that code as to the need to do this vs using the static type.

Once we switched the code to create an instance of the RecurringJobManager for each call (whether registering or triggering the jobs) the tests now execute correctly and are passing.

In our case, the overhead of creating an instance per call isn't an issue as it's only on startup (once per job) and the act of triggering the jobs is only done in our integration tests so it's not in any critical path/high performance code for us.

For now this should unblock us, but we'd still like to see a more long-term/permanent solution that wouldn't require us to recreate instances of the job manager class over & over and ideally we'd be coding against those static classes as that's what you see cited in documentation / examples / blogs / etc. and it seems like the intent is for those to be what is used.

I recognize that our use here isn't typical runtime usage but in the case of integration testing it doesn't seem like we are doing anything wrong either (imo) as we're just creating/using 'test host' instances.

adamdriscoll commented 7 months ago

Thanks. Was having the same problem and this fixed it. You'll also need to replace BackgroundJob with instances of BackgroundJobClient.

HangfireIO / Hangfire