Hangfire Jobs Stop Processing

StrangeWill commented 10 months ago

We've had this issue across all of our hangfire installs, across probably 10+ apps, rarely (eg: every few hundred thousand jobs, sometimes millions), across various backend providers (MSSQL, Postgres, MySQL, Redis) where all queues will just completely hang.

This has been a massive headache for us, jobs stay stuck in the queue, deleting them does not clear them out, they just sit in a "deleted" state, all queued items cannot be deleted, jobs stay stuck in an "in progress" state, nothing moves, only solution is to restart the app.

Other than that, troubleshooting has been difficult, we've sometimes built health checks to at least alert us when Hangfire has hung simply based on last completed job time, but having to kick it randomly is a massive reliability issue for us. As far as we can tell reading the documentation there is nothing we have as far as watchdog tweaks we can do or anything to help recover from what seems to be a frozen/crashed manager/worker. We don't even know what to look at internally to determine what is going on or what logs to turn on/monitor for (we're on .NET 7.0). We don't see and thrown warnings or errors in ILogger to hint that there's a larger issue.

odinserj commented 10 months ago

Please run stdump utility to obtain manage stack traces of a hanging process and post the results here. Very often this happens due to CLR's thread pool starvation, or blocked network calls with no timeout set unrelated to Hangfire, and running the utility above will show in what method threads stuck.

Malgefor commented 10 months ago

Having the same problem here. This really needs to be fixed ASAP since we have had multiple production outages because of it. We are also on .NET 7 using the DynamicJobs using MS SQL Server for storage. At some point, jobs are only getting enqueued and nothing gets executed anymore.

Our setup: we have multiple types of workers that each handle specific recurring jobs (on a cron-schedule) based on their configured queue. They all use the same SQL database via DynamicJobs. So we have Worker 1 that handles jobs A, B and C, Worker 2 that handles jobs D, E and F etc.

When the hanging starts, it starts for all workers at the same time, so the common denominator seems to be the SQL database. It seems to me that if one of the worker applications would hang the other jobs that are executed by the other workers should continue as normal.

Below a screenhot of the dashboard then the hanging starts: hangfire_issues

StrangeWill commented 9 months ago

stdump has been problematic on the system that was having the problem on IIS (it wouldn't provide useful information/fail to dump/etc.)

However one app we had we eliminated the problem by making this change:

.UseRedisStorage(redisConnection).WithJobExpirationTimeout(TimeSpan.FromHours(24))

With job expiration timeout it stopped entirely, we now have over 33 million jobs with no more hang ups. We're testing this across other setups.

mwasson74 commented 7 months ago

@odinserj IndexOutOfRangeException - What am I doing wrong? May I please get a little help with my stdump issue? 🙏

HangfireIO / Hangfire

Hangfire Jobs Stop Processing #2311