Recurring Job stuck after being considered "aborted"

Hey guys, I'm noticing a very strange behavior for a long time now.

It happened on my development environment, and I was able to reproduce the problem on my local Hangfire environment.

My HF application is using the latest version, and it is configured to run using a SQL Server database.

To reproduce the problem that's what I did on my local environment:

Created a test job that runs on every minute, and it does nothing except thread sleeping for 5 minutes
I put it to run, and after some seconds I manually stopped the process
Then I restarted the process and went to the processing cue. In there I could see the previous triggered job "processing" but marked as aborted.
Then, after 1 minute the application should trigger again (because it's configured to run on every minute) but nothing happens. The job is stuck on the aborted state forever.

The only way to make it work again is by destroying the database entirely (deleting all tables) and re-running the application, so the database gets created and everything starts working again.

Any ideas of what is going on?

Thanks!

Hello everyone. I found a solution to the problem.

During my tests, it seems to me that if a job process is killed while it was still processing (or even if the database connection for some reason disconnects), when it connects back, the ServerId gets changed (what makes sense, since it's recognized a new server that will be added to the server list with a new ID), the previous process enters into a faulted state since it was related to the server that doesn't exists anymore. With that, newly recurring processes of the same faulted job will be canceled because there is already one pending one in the list at the moment waiting for the deceased server.

That said, one way to address the issue is by checking if there are any orphan jobs pending in the database when the application starts and manually delete it. That requires a system restart in order to put everything back on track but works perfectly.

One other way to address the issue (and that's the one I'm using in the production environment) is to check for orphan jobs on a recurring job that can occur whenever I decide to (right now it's checking on every 5 minutes), and whenever an orphan job is found it gets deleted, what makes the next recurring job to work normally.

@odinserj , do you think it is a good idea for me to create a pull request that adds a new optional feature to purge orphan jobs in a timely configurable fashion? I honestly think that A LOT of people struggles with that problematic behavior that keeps jobs in a fault/canceled state.

What do you think? Let me know if you'd like me to implement that to the codebase.

Cheers, and have a good one! 👊

HangfireIO / Hangfire

Recurring Job stuck after being considered "aborted" #2439