Open StrangeWill opened 10 months ago
Please run stdump utility to obtain manage stack traces of a hanging process and post the results here. Very often this happens due to CLR's thread pool starvation, or blocked network calls with no timeout set unrelated to Hangfire, and running the utility above will show in what method threads stuck.
Having the same problem here. This really needs to be fixed ASAP since we have had multiple production outages because of it. We are also on .NET 7 using the DynamicJobs using MS SQL Server for storage. At some point, jobs are only getting enqueued and nothing gets executed anymore.
Our setup: we have multiple types of workers that each handle specific recurring jobs (on a cron-schedule) based on their configured queue. They all use the same SQL database via DynamicJobs. So we have Worker 1 that handles jobs A, B and C, Worker 2 that handles jobs D, E and F etc.
When the hanging starts, it starts for all workers at the same time, so the common denominator seems to be the SQL database. It seems to me that if one of the worker applications would hang the other jobs that are executed by the other workers should continue as normal.
Below a screenhot of the dashboard then the hanging starts:
stdump
has been problematic on the system that was having the problem on IIS (it wouldn't provide useful information/fail to dump/etc.)
However one app we had we eliminated the problem by making this change:
.UseRedisStorage(redisConnection).WithJobExpirationTimeout(TimeSpan.FromHours(24))
With job expiration timeout it stopped entirely, we now have over 33 million jobs with no more hang ups. We're testing this across other setups.
@odinserj IndexOutOfRangeException - What am I doing wrong? May I please get a little help with my stdump issue? 🙏
We've had this issue across all of our hangfire installs, across probably 10+ apps, rarely (eg: every few hundred thousand jobs, sometimes millions), across various backend providers (MSSQL, Postgres, MySQL, Redis) where all queues will just completely hang.
This has been a massive headache for us, jobs stay stuck in the queue, deleting them does not clear them out, they just sit in a "deleted" state, all queued items cannot be deleted, jobs stay stuck in an "in progress" state, nothing moves, only solution is to restart the app.
Other than that, troubleshooting has been difficult, we've sometimes built health checks to at least alert us when Hangfire has hung simply based on last completed job time, but having to kick it randomly is a massive reliability issue for us. As far as we can tell reading the documentation there is nothing we have as far as watchdog tweaks we can do or anything to help recover from what seems to be a frozen/crashed manager/worker. We don't even know what to look at internally to determine what is going on or what logs to turn on/monitor for (we're on .NET 7.0). We don't see and thrown warnings or errors in
ILogger
to hint that there's a larger issue.