HangfireIO / Hangfire

An easy way to perform background job processing in .NET and .NET Core applications. No Windows Service or separate process required
https://www.hangfire.io
Other
9.2k stars 1.68k forks source link

Loss of SQL connection causes Job ID to increase 10k #1995

Open Preferred-Mutual-Insurance-Company opened 2 years ago

Preferred-Mutual-Insurance-Company commented 2 years ago

Background

Once a month we patch our servers. When the SQL Server that hangfire is using becomes unavailable I’ll see errors like:

2022-01-27 08:03:31,014 [ServerHeartbeatProcess #1] WARN LogExtensions.WarnException - Server :3996:8847430c encountered an exception while sending heartbeat System.Data.SqlClient.SqlException (0x80131904): SHUTDOWN is in progress. Login failed for user '’. Cannot continue the execution because the session is in the kill state. A severe error occurred on the current command. The results, if any, should be discarded. Error Number:6005,State:1,Class:14

2022-01-27 08:03:55,793 [Worker #6] ERROR LogExtensions.ErrorException - Execution Worker is in the Failed state now due to an exception, execution will be retried no more than in 00:02:49 System.InvalidOperationException: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached.

2022-01-27 08:03:59,669 [DelayedJobScheduler #1] ERROR LogExtensions.ErrorException - Execution DelayedJobScheduler is in the Failed state now due to an exception, execution will be retried no more than in 00:00:16 System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server) —> System.ComponentModel.Win32Exception (0x80004005): The network path was not found

Eventually followed by:

2022-01-27 08:10:16,843 [CountersAggregator #1] INFO BackgroundExecution.ToRunningState - Execution CountersAggregator recovered from the Faulted state after 00:05:31.0423512 and is in the Running state now

2022-01-27 08:10:16,875 [ServerWatchdog #1] INFO BackgroundExecution.ToRunningState - Execution ServerWatchdog recovered from the Faulted state after 00:05:23.2092033 and is in the Running state now

The system seems to recover fine but my concern is that the Job IDs seem to jump dramatically, about 10k.

28668 → 38640 in one environment image

31242 → 41211 in another environment image

My questions are:

Thank you kindly for any and all feedback.

Technical Details

Packages Used:

Setup:

Configuration (roughly):

config
    .UseSqlServerStorage(connectionString, new SqlServerStorageOptions
    {
        PrepareSchemaIfNecessary = false,
        SchemaName = "task"
    })
    .UseSerializerSettings(new JsonSerializerSettings { TypeNameHandling = TypeNameHandling.All })
    .UseConsole()
    .UseTagsWithSql(new TagsOptions
    {
        TagsListStyle = TagsListStyle.Dropdown
    }, new SqlServerStorageOptions
    {
        SchemaName = "task"
    })
        .UseNinjectActivator(kernel);
Preferred-Mutual-Insurance-Company commented 2 years ago

Thank you to anyone who at least looked at this issue. I believe I've discovered the problem. Here are a few articles on the issue:

I think we'll look to try disabling IDENTITY_CACHE at the database level. Unfortunately other objects exist in this database besides hangfire so they'll be affected by the setting. The last article mentions using a SEQUENCE instead of an auto-increment. Perhaps that might be a consideration in a future version (they've been supported since SQL 2012) so that we could target just the hangfire tables.

I'll report back on the outcome.