HangfireIO / Hangfire

An easy way to perform background job processing in .NET and .NET Core applications. No Windows Service or separate process required
https://www.hangfire.io
Other
9.29k stars 1.68k forks source link

Distributed Lock Timeout Exception - Timeout expired #1799

Open pradeepm1207 opened 3 years ago

pradeepm1207 commented 3 years ago

Hello Everyone, I have a DistributedLockTimeoutException. Below are the details.

Issue : When a Hangfire job is running and is not yet completed, and for some reason the server that is running the hangfire job comes down due to deployment or etc., before the hangfire job gets completed, then any new jobs that gets started later after the server starts cannot acquire a lock for executing the job as the lock is still held by the old job that was running before the server got down. The job runs every minute. These jobs are recurring jobs and below is the hangfire setup :

services.AddHangfire(config => config .UseSerilogLogProvider() .SetDataCompatibilityLevel(CompatibilityLevel.Version_170) .UseSimpleAssemblyNameTypeSerializer() .UseRecommendedSerializerSettings() .UsePostgreSqlStorage(hangfireConnectionString, new PostgreSqlStorageOptions { DistributedLockTimeout = TimeSpan.FromMinutes(2), PrepareSchemaIfNecessary = true }));

Here is the error i got when there was a deployment :

"Hangfire.PostgreSql.PostgreSqlDistributedLockException: Could not place a lock on the resource 'HangFire:Processor.ExecuteAsync-1/1': Lock timeout.\n at Hangfire.PostgreSql.PostgreSqlDistributedLock.PostgreSqlDistributedLock_Init_Transaction(String resource, TimeSpan timeout, IDbConnection connection, PostgreSqlStorageOptions options)\n at Hangfire.PostgreSql.PostgreSqlDistributedLock..ctor(String resource, TimeSpan timeout, IDbConnection connection, PostgreSqlStorageOptions options)\n at Hangfire.PostgreSql.PostgreSqlConnection.AcquireDistributedLock(String resource, TimeSpan timeout)\n at Hangfire.MaximumConcurrentExecutionsAttribute.OnPerforming(PerformingContext filterContext)\n at Hangfire.Server.BackgroundJobPerformer.InvokeOnPerforming(Tuple2 x)\n at Hangfire.Profiling.ProfilerExtensions.InvokeAction[TInstance](InstanceAction1 tuple)\n at Hangfire.Profiling.SlowLogProfiler.InvokeMeasured[TInstance,TResult](TInstance instance, Func2 action, String message)\n at Hangfire.Profiling.ProfilerExtensions.InvokeMeasured[TInstance](IProfiler profiler, TInstance instance, Action1 action, String message)\n at Hangfire.Server.BackgroundJobPerformer.InvokePerformFilter(IServerFilter filter, PerformingContext preContext, Func1 continuation)"`

After Deployment, with the new jobs, i am getting the following error :

"Hangfire.Storage.DistributedLockTimeoutException: Timeout expired. The timeout elapsed prior to obtaining a distributed lock on the 'Processor.ExecuteAsync' resource.\n at Hangfire.MaximumConcurrentExecutionsAttribute.OnPerforming(PerformingContext filterContext)\n at Hangfire.Server.BackgroundJobPerformer.InvokeOnPerforming(Tuple2 x)\n at Hangfire.Profiling.ProfilerExtensions.InvokeAction[TInstance](InstanceAction1 tuple)\n at Hangfire.Profiling.SlowLogProfiler.InvokeMeasured[TInstance,TResult](TInstance instance, Func2 action, String message)\n at Hangfire.Profiling.ProfilerExtensions.InvokeMeasured[TInstance](IProfiler profiler, TInstance instance, Action1 action, String message)\n at Hangfire.Server.BackgroundJobPerformer.InvokePerformFilter(IServerFilter filter, PerformingContext preContext, Func1 continuation)"`

Additional Info : I am using MaximumConcurrentExecutions hangfire extensions. Below are the packages i am using. Hangfire.AspNetCore - Version="1.7.18" Hangfire.MaximumConcurrentExecutions - Version="1.1.0" Hangfire.PostgreSql - Version="1.8.1"

Any help is really appreciated. Thank you in advance.

Frazi1 commented 3 years ago

Hello @pradeepm1207

This issue is more related to Hangfire.Postgresql storage, not Hangfire itself.

It happens because due to how distributed locks are implemented in Hangfire.Postgresql. They use a separate database table called lock. When a recurring job starts, Hangfire will put a record into lock table. If the server stops non-gracefully, the lock is not released.

Hangfire.Postgresql locks automatically expire after the configurable timeout (default is 10 minutes).

I see the following options for you: 1) If your recurring jobs are fast, try shutting the server gracefully so that it finishes any in-progress jobs. services.AddHangfireServer(options => options.ShutdownTimeout = TimeSpan.FromMinutes(5));. Also, make sure your environment is configured with the same or greater termination timeout. 2) Decrease lock timeout from 10 minutes to, say, 2 minutes (depending on how much time your jobs take to complete. Don't make the timeout less than job execution time, or else Hangfire will end up executing the same job multiple times, I think). It should expire the locks before you make a new deployment. services.AddHangfire(configuration => configuration.UsePostgreSqlStorage("", new PostgreSqlStorageOptions() { DistributedLockTimeout = TimeSpan.FromMinutes(2) })); 3) Manually remove the locks from lock table. 4) Switch to Hangfire.SqlSever. It handles locks differently. Such a situation should never happen with SqlServer.

pradeepm1207 commented 3 years ago

Hello @Frazi1

Thank you for your quick response. Currently i have DistributedLockTimeout set to 2mins(Option 2 as you suggested), But i am still seeing the issue that i mentioned(in the main post) post deployment. The recurring jobs are failing post deployment even when the lock is not present in lock table.

Option 4 is really not a choice for us. I will look into other options.

udlose commented 2 years ago

@Frazi1

  1. Switch to Hangfire.SqlSever. It handles locks differently. Such a situation should never happen with SqlServer.

We are using Hangfire.SqlServer and we are experiencing the same issue: Hangfire.Storage.DistributedLockTimeoutException: 'Timeout expired. The timeout elapsed prior to obtaining a distributed lock on the 'HangFire:lock:recurring-job:LocationUpdater' resource.'

mohd786hussain commented 1 year ago

Hangfire.Storage.DistributedLockTimeoutException even with SqlServer so what is the solution for that.

kethahel99 commented 1 year ago

We are using Hangfire.SqlServer and we are experiencing the same issue:

Hangfire.Storage.DistributedLockTimeoutException: 'Timeout expired. The timeout elapsed prior to obtaining a distributed lock on the 'HangFire:lock:recurring-job:LocationUpdater' resource.' 

Any feedback on this issue?

pmcfernandes commented 1 year ago

Drop database and create a blank.

jvmlet commented 11 months ago

Suffering from the same issue with MsSql.

Executing

    select * from sys.dm_exec_sessions  where session_id in  (
                   select request_session_id 
                   from sys.dm_tran_locks 
                    where resource_type = 'APPLICATION'
)

shows the session that acquired the lock in sleeping status @odinserj , would you please have a look ?

ERobishaw commented 11 months ago

also experiencing this with SQL Server:

Note, there are 2 server nodes... the job mentioned in the exception is a very long running job (90 minutes), but this is happening after only a few minutes...

Only started doing this after setting up IIS to ensure the app is always running, as per: https://docs.hangfire.io/en/latest/deployment-to-production/making-aspnet-app-always-running.html

This is a asp.net core application

NOTE: the process itself is actually still running... I can see by the logs, that it's continuing to operate as expected. This is only an issue with the dashboard evidently.

BUT, when the lock expires... it kills the process and re-schedules it.

Hangfire.Storage.DistributedLockTimeoutException: Timeout expired. The timeout elapsed prior to obtaining a distributed lock on the 'HangFire:IntegrationRunnerSPMJobs.RunDeleteScenariosJob' resource. at Hangfire.SqlServer.SqlServerDistributedLock.Acquire(DbConnection connection, String resource, TimeSpan timeout) at Hangfire.SqlServer.SqlServerConnection.AcquireLock(String resource, TimeSpan timeout) at Hangfire.DisableConcurrentExecutionAttribute.OnPerforming(PerformingContext filterContext) at Hangfire.Server.BackgroundJobPerformer.InvokeOnPerforming(Tuple2 x) at Hangfire.Profiling.ProfilerExtensions.InvokeAction[TInstance](InstanceAction1 tuple) at Hangfire.Profiling.SlowLogProfiler.InvokeMeasured[TInstance,TResult](TInstance instance, Func2 action, String message) at Hangfire.Profiling.ProfilerExtensions.InvokeMeasured[TInstance](IProfiler profiler, TInstance instance, Action1 action, String message) at Hangfire.Server.BackgroundJobPerformer.InvokePerformFilter(IServerFilter filter, PerformingContext preContext, Func`1 continuation)

jvmlet commented 11 months ago

In my case it was my mistake... the job that acquired the lock entered endless loop (job logic issue) All folks suffering from this issue I can suggest to try with latest version of slq server.