HangfireIO / Hangfire

An easy way to perform background job processing in .NET and .NET Core applications. No Windows Service or separate process required
https://www.hangfire.io
Other
9.44k stars 1.71k forks source link

Jobs Stuck in Enqueued State #2355

Open mwasson74 opened 10 months ago

mwasson74 commented 10 months ago

I realize I'm not using the latest versions of things but these were the latest versions when I started having this issue in production. But due to stdump issue IndexOutOfRangeException - What am I doing wrong? I could not get a stack trace dump when I had the latest version of the packages.

ASP.NET Core .NET 6 Hangfire.AspNetCore" Version="1.8.6" Hangfire.Console" Version="1.4.2" Hangfire.Core" Version="1.8.6" Hangfire.Dashboard.BasicAuthorization" Version="1.0.2" Hangfire.Mongo" Version="1.9.12"

stdump_hangfire.txt

Classes have this attribute applied: SkipWhenPreviousJobIsRunningAttribute.txt

Execute Methods have [DisableConcurrentExecution("{0}", 3)] applied

image

image

HangfireDashboard

odinserj commented 10 months ago

Thanks for the dump file! I believe the following thread is the most interesting one. It holds a semaphore, so other worker threads are waiting on its completion. And if this thread stuck, then new background jobs will not be processed. And it is likely it's stuck.

I found the following issue on GitHub - https://github.com/dotnet/runtime/issues/70656 - with a similar stack trace happened in .NET 6.X and that issue states the issue was fixed in .NET 7.0. I see you are using an affected version, so perhaps the best recommendation I can give is to upgrade to a newer .NET version. Unfortunately, I also see https://github.com/dotnet/runtime/issues/83455, but looks like it was fixed in .NET 7.0.7 and 8.0.

Thread #41
  OS Thread ID:      81092
  AppDomain Address: 1776550875936
  State:             176672

  Managed stack trace:
   - [InlinedCallFrame] (Interop+Winsock.recv) at System.Net.Sockets.dll
   - [InlinedCallFrame] (Interop+Winsock.recv) at System.Net.Sockets.dll
   -  at 
   - System.Net.Sockets.Socket.Receive(System.Span`1<Byte>, System.Net.Sockets.SocketFlags, System.Net.Sockets.SocketError ByRef) at System.Net.Sockets.dll
   - System.Net.Sockets.NetworkStream.Read(System.Span`1<Byte>) at System.Net.Sockets.dll
   - System.Net.Security.SslStream+<EnsureFullTlsFrameAsync>d__186`1[[System.Net.Security.SyncReadWriteAdapter, System.Net.Security]].MoveNext() at System.Net.Security.dll
   -  at 
   -  at 
   - System.Net.Security.SslStream+<ReadAsyncInternal>d__188`1[[System.Net.Security.SyncReadWriteAdapter, System.Net.Security]].MoveNext() at System.Net.Security.dll
   -  at 
   - System.Net.Security.SslStream.Read(Byte[], Int32, Int32) at System.Net.Security.dll
   - MongoDB.Driver.Core.Misc.StreamExtensionMethods.ReadBytes(System.IO.Stream, Byte[], Int32, Int32, System.Threading.CancellationToken) at MongoDB.Driver.Core.dll
   - MongoDB.Driver.Core.Connections.BinaryConnection.ReceiveBuffer(System.Threading.CancellationToken) at MongoDB.Driver.Core.dll
   - MongoDB.Driver.Core.Connections.BinaryConnection.ReceiveBuffer(Int32, System.Threading.CancellationToken) at MongoDB.Driver.Core.dll
   - MongoDB.Driver.Core.Connections.BinaryConnection.ReceiveMessage(Int32, MongoDB.Driver.Core.WireProtocol.Messages.Encoders.IMessageEncoderSelector, MongoDB.Driver.Core.WireProtocol.Messages.Encoders.MessageEncoderSettings, System.Threading.CancellationToken) at MongoDB.Driver.Core.dll
   - MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool+PooledConnection.ReceiveMessage(Int32, MongoDB.Driver.Core.WireProtocol.Messages.Encoders.IMessageEncoderSelector, MongoDB.Driver.Core.WireProtocol.Messages.Encoders.MessageEncoderSettings, System.Threading.CancellationToken) at MongoDB.Driver.Core.dll
   - MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool+AcquiredConnection.ReceiveMessage(Int32, MongoDB.Driver.Core.WireProtocol.Messages.Encoders.IMessageEncoderSelector, MongoDB.Driver.Core.WireProtocol.Messages.Encoders.MessageEncoderSettings, System.Threading.CancellationToken) at MongoDB.Driver.Core.dll
   - MongoDB.Driver.Core.WireProtocol.CommandUsingCommandMessageWireProtocol`1[[System.__Canon, System.Private.CoreLib]].Execute(MongoDB.Driver.Core.Connections.IConnection, System.Threading.CancellationToken) at MongoDB.Driver.Core.dll
   - MongoDB.Driver.Core.WireProtocol.CommandWireProtocol`1[[System.__Canon, System.Private.CoreLib]].Execute(MongoDB.Driver.Core.Connections.IConnection, System.Threading.CancellationToken) at MongoDB.Driver.Core.dll
mwasson74 commented 10 months ago

@odinserj, thank you so much for getting back to me on this so quickly!! I have upgraded to .NET 8 just now and am about to deploy to see how it goes!! 🤞

mwasson74 commented 10 months ago

It did not go well. Here is the stack trace from when it happened again:

stdump_hangfire2.txt

ASP.NET Core .NET 8 Hangfire.AspNetCore Version="1.8.9" Hangfire.Console Version="1.4.2" Hangfire.Core Version="1.8.9" Hangfire.Dashboard.BasicAuthorization Version="1.0.2" Hangfire.Mongo Version="1.9.16"

odinserj commented 10 months ago

Hm, so the main issue is that the number of enqueued metrics is inconsistent with the record themselves, e.g. it shows there are some jobs, but you don't see them?

image

mwasson74 commented 10 months ago

That is, I assume, the symptom of the underlying issue. When this happens, the system thinks those jobs are still running and won’t enqueue them again. So in the instance from the screen shot, we now have 63 unique recurring jobs that never get enqueued again. The only way I can find to get them running again is to stop the app pool, drop all hangfire.* collections from mongo, and then start the app pool again. (we add the recurring jobs on startup)

odinserj commented 10 months ago

In this case, I might be causing you to go in a wrong direction with that method and .NET upgrade, sorry for this.

I think it's better to raise an issue in the Hangfire.Mongo repository and describe the situation, because counters and actual contents should be consistent with each other.

jonathancounihan commented 9 months ago

I have the same issue with the SQL storage - there are always 10 jobs in the counter but nothing is enqueued.

.NET 4.6.1 Hangfire 1.8.6 Hangfire.Core 1.8.6 Hangfire.SqlServer 1.8.6.

image

mwasson74 commented 9 months ago

@jonathancounihan

I am using Hangfire.Mongo and the owner said that he's found a bug in Hangfire.Mongo and he's pretty sure the same would happen with Sql Storage, too. https://github.com/gottscj/Hangfire.Mongo/issues/380#issuecomment-1925809164