DisableConcurrentExecutionAttribute doesn't work for long-running jobs

jessemcdowell-AI commented 3 years ago

I have a recurring job that I run every 5 minutes. It usually completes within a minute, but every once in a while it runs longer than 5 minutes. It can be as long as 10 days in very rare situations. It's also important that we not allow more than one instance of the job to run in parallel. To achieve this, I'm using the DisableConcurrentExecutionAttribute. Because of the implementation of distribute locks, I cannot prevent parallel execution when the job runs longer than the DistributedLockTimeout. I don't really want to set DistributedLockTimeout to a huge value either, because then my recurring job might not run after a server failure.

For comparison, the MS SQL implementation uses the built in sp_getapplock which doesn't seem to have this problem. I've successfully held locks for 30 minutes, and they get released instantly when the acquiring application is killed.

It should be possible to add some kind of refresh so that renews open locks as long as the server is still alive.

Or, alternatively, the value of server.id could be included in the lock table, and locks could be automatically removed when a server is timed out. In this scenario I could use a huge lock timeout safely.

ahydrax commented 3 years ago

Hi @jessemcdowell-AI , thanks for noticing that! I've already planned long-living locks in the road map https://github.com/ahydrax/Hangfire.PostgreSql/projects/1#card-25353357

ahydrax commented 3 years ago

I like the idea to include server id in lock to make sure that the lock is safe to remove due to timeout.

ahydrax commented 3 years ago

Fixed in https://www.nuget.org/packages/Hangfire.PostgreSql.ahydrax/1.7.4 /cc @jessemcdowell-AI

ahydrax / Hangfire.PostgreSql

DisableConcurrentExecutionAttribute doesn't work for long-running jobs #23