HangfireIO / Hangfire

An easy way to perform background job processing in .NET and .NET Core applications. No Windows Service or separate process required
https://www.hangfire.io
Other
9.43k stars 1.71k forks source link

KeyNotFoundException when retrieving recurring jobs #2412

Closed redstarty closed 5 months ago

redstarty commented 5 months ago

Hey there!

We use Hangfire with SQL server and the following code to check if the job exists:

var job = JobStorage.Current.GetConnection().GetRecurringJobs(new[] { id }).FirstOrDefault();

It seems to be randomly failing with

System.Collections.Generic.KeyNotFoundException: The given key 'Cron' was not present in the dictionary.
   at System.Collections.Generic.Dictionary`2.get_Item(TKey key)
   at Hangfire.Storage.StorageConnectionExtensions.GetRecurringJobDtos(IStorageConnection connection, IEnumerable`1 ids) in C:\projects\hangfire-525\src\Hangfire.Core\Storage\StorageConnectionExtensions.cs:line 103

In the StorageConnectionExtensions I see the following lines:

var hash = connection.GetAllEntriesFromHash($"recurring-job:{id}");

// TODO: Remove this in 2.0 (breaking change)
if (hash == null)
{
    result.Add(new RecurringJobDto { Id = id, Removed = true });
    continue;
}

var dto = new RecurringJobDto
{
    Id = id,
    Cron = hash["Cron"]
};

So the question is how could it be that for the existing job (since hash is not null, it exists, right?) Cron is not saved?

Job are created using IRecurringJobManager's AddOrUpdateDynamic method where cron is always provided. Then the job could be deleted using RemoveIfExists method.

All comments and ideas are appreciated!

odinserj commented 5 months ago

Hi @redstarty! This is a very great question, because the Cron field is a mandatory field. Can you send me the output of the following SQL query, where {id} is the identifier of your recurring job in question?

select Field, Value from HangFire.Hash where [Key] = N'recurring-job:{id}'

Nevertheless, I have just committed a change to avoid throwing the KeyNotFoundException in this case.

redstarty commented 5 months ago

Hey @odinserj, thank you for a quick reply!

It seems that each problematic job (I have around 50 of them, so it's not some random glitch), has the following single entry to the Hash table: Field Value
Running no
odinserj commented 5 months ago

Is it possible that someone removed those entries manually, bypassing Hangfire APIs? Updates to recurring jobs are synchronized and performed in a transaction. Also, what version of Hangfire.Core you are using?

redstarty commented 5 months ago

I am using 1.8.12 version. And manual changes to this database are out of question, it's prod data and even querying them like that makes my palms a bit sweaty 😄

odinserj commented 5 months ago

Looks like there are some filters for recurring job methods, and looks like it's the one related to disabling multiple executions. Am I right?

redstarty commented 5 months ago

Ah, you are completely right, we are using SkipWhenPreviousJobIsRunningAttribute provided by you here

marsel-mo commented 5 months ago

After investigation, I found out that our customer SkipWhenPreviousJobIsRunningAttribute will trigger OnStateApplied method after the job has been deleted.

The reason why it is doing that is this scenario : The job will be triggered for running, during the running time and still not completed we delete the job! On time the job is completed method OnStateApplied will be triggered even if the job is deleted and it will create a new row in the table hash.

If we want to keep the custom attribute SkipWhenPreviousJobIsRunningAttribute
Before adding the new row to hash table check if the job is deleted or not

var job = JobStorage.Current.GetConnection().GetRecurringJobs(new[] { recurringJobId }).FirstOrDefault();
 if(job is { Removed: true})  return;

 transaction.SetRangeInHash(
     $"recurring-job:{recurringJobId}",
     new[] { new KeyValuePair<string, string>(RunningKey, "no") });
odinserj commented 5 months ago

You are correct, there's a race condition that can cause this. I already pushed a fix to avoid throwing in the Dashboard UI, so with Hangfire 1.8.13 it will be just possible to remove such recurring job. As for the filter, Hangfire 1.8 allows to acquire transaction-level locks, so it's possible to modify this filter and avoid race condition by locking a recurring job entity, so just need to fix it.