HangfireIO / Hangfire

An easy way to perform background job processing in .NET and .NET Core applications. No Windows Service or separate process required
https://www.hangfire.io
Other
9.32k stars 1.69k forks source link

Hangfire Load by waits (AAS) #1906

Open dazbradbury opened 3 years ago

dazbradbury commented 3 years ago

Firstly - thanks for an awesome library!

We're seeing an issue with Hangfire suddenly (in the last 24 hours) consuming a large amount of our database CPU.

The queries that seem to be taking over are:

delete top (1) JQ
output DELETED.Id, DELETED.JobId, DELETED.Queue
from [HangFire].JobQueue JQ with (readpast, updlock, rowlock, forceseek)
where Queue in (@queues1) and (FetchedAt is null or FetchedAt < DATEADD(second, @timeout, GETUTCDATE()))

And the very similar:

delete top (1) JQ
output DELETED.Id, DELETED.JobId, DELETED.Queue
from [HangFire].JobQueue JQ with (readpast, updlock, rowlock, forceseek)
where Queue in (@queues1,@queues2) and (FetchedAt is null or FetchedAt < DATEADD(second, @timeout, GETUTCDATE()))

I've upgraded from Hangfire Core and SQLServer 1.7.15 -> 1.7.24, but the issue remains.

Hangfire is running on a single server, with 4 queues and 121 "workers". Split 1/50/50/20.

Even running a select on the JobQueue table seems to take > 10 seconds. Is there a sensible check we can make to work out what's causing this table to suddenly respond so slowly?

dazbradbury commented 3 years ago

I forgot to say that I'm on schema 7, and this is my configuration if helpful:

GlobalConfiguration.Configuration
               .UseSqlServerStorage(connectionString
               , new Hangfire.SqlServer.SqlServerStorageOptions()
               {
                   UsePageLocksOnDequeue = true,
                   DisableGlobalLocks = true,
                   EnableHeavyMigrations = false,
               }
               )
               .SetDataCompatibilityLevel(CompatibilityLevel.Version_170)
               .UseSimpleAssemblyNameTypeSerializer()
               .UseRecommendedSerializerSettings()
               ;

I've also noticed if I run:

SELECT
OBJECT_NAME(p.OBJECT_ID) AS TableName,
resource_type, resource_description
FROM
sys.dm_tran_locks l
JOIN sys.partitions p ON l.resource_associated_entity_id = p.hobt_id

I get 50k+ PAGE locks back.

Running update statistics seemed to reduce load a little, but not fix the issue sadly. Still investigating...

dazbradbury commented 3 years ago

To provide an update, I changed the configuration from what is shown above to what I understand are the recommended values:

GlobalConfiguration.Configuration
   .UseSqlServerStorage(connectionString
   , new Hangfire.SqlServer.SqlServerStorageOptions()
   {
       CommandBatchMaxTimeout = TimeSpan.FromMinutes(5), // default value seems to be null, 5 minutes is reccommended value
       SlidingInvisibilityTimeout = TimeSpan.FromMinutes(5), // default value seems to be null, 5 minutes is reccommended value
       QueuePollInterval = TimeSpan.Zero, // default value seems to be TimeSpan.FromSeconds(15), zero is reccommended value
       UseRecommendedIsolationLevel = true, // default value is false, true is reccommended value (bizarre?)
       // UsePageLocksOnDequeue = true, // default value is false, reccommended value is unset
       DisableGlobalLocks = true, // reccommended is true
       EnableHeavyMigrations = false, // means we need to manually migrate schemas. This is fine / sensible given otherwise can cause downtime / deadlocks.
   }
   )
   .SetDataCompatibilityLevel(CompatibilityLevel.Version_170)
   .UseSimpleAssemblyNameTypeSerializer()
   .UseRecommendedSerializerSettings()
   ;

This immediately had the impact of stopping the "delete" jobs consuming all our DB CPU time. However, that has been seemingly switched to the following SQL:

update top (1) JQ
set FetchedAt = GETUTCDATE()
output INSERTED.Id, INSERTED.JobId, INSERTED.Queue, INSERTED.FetchedAt
from [HangFire].JobQueue JQ with (forceseek, readpast, updlock, rowlock)
where Queue in (@queues1,@queues2) and
(FetchedAt is null or FetchedAt < DATEADD(second, @timeoutSs, GETUTCDATE()));

With overall CPU usage remaining the same (ie. still being mostly consumed by hangfire queries on the JobQueue table).

dazbradbury commented 3 years ago

OK - so the issue seemed to be with the JobQueue index, which had a high number of pages and ghost rows:

image

Rebuilding the index whilst hangfire was running, was impossible. However, by turning hangfire off (ie. still able to add jobs, but not running any jobs / grabbing any jobs from the queue) we could rebuild the index using:

ALTER INDEX ALL ON [HangFire].[JobQueue] REBUILD

Note that reorganising wasn't sufficient and a full rebuild was required.

Once this index was rebuilt, it fixed the issue almost immediately. Once things we stable, we also noted similarly high ghost/page counts on other hangfire tables, and subsequently ran:

ALTER INDEX ALL ON [HangFire].[AggregatedCounter] REBUILD
ALTER INDEX ALL ON [HangFire].[Counter] REBUILD
ALTER INDEX ALL ON [HangFire].[Job] REBUILD
ALTER INDEX ALL ON [HangFire].[JobParameter] REBUILD
ALTER INDEX ALL ON [HangFire].[Server] REBUILD
ALTER INDEX ALL ON [HangFire].[Set] REBUILD
ALTER INDEX ALL ON [HangFire].[State] REBUILD

This didn't seem strictly necessary, but seemed sensible to do whilst here. Again, with the hangfire workers all switched off, these rebuilds happen very quickly.

The question that remains is, how do we prevent this from happening again?

odinserj commented 3 years ago

Yes, high number of ghost rows is likely the reason for the high CPU usage. Due to the nature of the Dequeue method that's based on a relational table, rows are scanned in a sequential manner. And the presence of ghost rows cause SQL Server to scan more rows than required (because they take space, they should be scanned anyway) spending more CPU than required.

Those ghost rows are likely to be here due to blocked rows when previous fetching implementation was used (e.g. SlidingInvisibilityTimeout is null), preventing a special background thread in SQL Server to clean them up. When you switched to the sliding-invisibility-timeout-based method, the problem of blocked rows/pages was eliminated and ghost rows destroyer is now able to proceed. But since their number was really high, you haven't seen any result until index was rebuilt.

Normally there will be some ghost rows under the normal conditions, and that special thread will clean them every 5 seconds or so. And the UsePageLocksOnDequeue (works only when SlidingInvisibilityTimeout is set) option will cause Hangfire.SqlServer to use PAGLOCK table hint on dequeue to avoid ghost rows at all (because they are removed instantly when page lock is acquired in the DELETE statement). But this option causes additional blocking on the JobQueue table, that's why it's not the recommended option.

dazbradbury commented 3 years ago

Thanks @odinserj - Will close this issue now, as it seems with the latest recommended configuration the ghost rows no longer accumulate. I'll set up a check to monitor this, and re-open if the issue re-emerges. Thanks for your help!

dazbradbury commented 2 years ago

Hi @odinserj - It looks like this issue re-emerged after all. This query:

SELECT
OBJECT_NAME(p.OBJECT_ID) AS TableName,
resource_type, resource_description
FROM
sys.dm_tran_locks l
JOIN sys.partitions p ON l.resource_associated_entity_id = p.hobt_id

Shows 134k+ PAGE locks on the JobQueue table. We can also see ghost rows are back:

table_view index_name ghost_record_count version_ghost_record_count alloc_unit_type_desc
HangFire.Counter CX_HangFire_Counter 0 3980000 IN_ROW_DATA
HangFire.AggregatedCounter IX_HangFire_AggregatedCounter_ExpireAt 0 101 IN_ROW_DATA
HangFire.Hash IX_HangFire_Hash_ExpireAt 0 0 IN_ROW_DATA
HangFire.Job IX_HangFire_Job_ExpireAt 0 1517800 IN_ROW_DATA
HangFire.Job IX_HangFire_Job_ExpireAt 0 707400 LOB_DATA
HangFire.Job IX_HangFire_Job_StateName 0 1524800 IN_ROW_DATA
HangFire.Job IX_HangFire_Job_StateName 0 710300 LOB_DATA
HangFire.List IX_HangFire_List_ExpireAt 0 0 IN_ROW_DATA
HangFire.Server IX_HangFire_Server_LastHeartbeat 0 48 IN_ROW_DATA
HangFire.Set IX_HangFire_Set_ExpireAt 0 863434 IN_ROW_DATA
HangFire.Set IX_HangFire_Set_Score 0 863519 IN_ROW_DATA
HangFire.AggregatedCounter PK_HangFire_CounterAggregated 0 101 IN_ROW_DATA
HangFire.Hash PK_HangFire_Hash 0 0 IN_ROW_DATA
HangFire.Job PK_HangFire_Job 0 1520900 IN_ROW_DATA
HangFire.Job PK_HangFire_Job 0 710400 LOB_DATA
HangFire.JobParameter PK_HangFire_JobParameter 0 3125200 IN_ROW_DATA
HangFire.JobQueue PK_HangFire_JobQueue 0 1297600 IN_ROW_DATA
HangFire.List PK_HangFire_List 0 0 IN_ROW_DATA
HangFire.Schema PK_HangFire_Schema 0 0 IN_ROW_DATA
HangFire.Server PK_HangFire_Server 0 48 IN_ROW_DATA
HangFire.Set PK_HangFire_Set 0 863544 IN_ROW_DATA
HangFire.State PK_HangFire_State 0 5862400 IN_ROW_DATA
HangFire.State PK_HangFire_State 0 394500 LOB_DATA

Will run the rebuild command again to reset everything, but it's unclear what caused the issue to re-emerge at this time as it seemed stable for a long time following the configuration changes above.

odinserj commented 2 years ago

I tried to search the internet for "version ghost record count" and found this article – https://techcommunity.microsoft.com/t5/sql-server-support-blog/query-performance-drops-due-to-version-ghost-record-count-in/ba-p/3425900. Could you tell me whether you are using AlwaysOn setup?

dazbradbury commented 2 years ago

Nope - we're not using an AlwaysOn setup. Having said that, I'm still trying to determine if the ghost rows / locks were actually causing any performance issues, so unlike last time we're still linking the DB performance issues with the root cause (and this was just one of the checks we wanted to rule out).

odinserj commented 2 years ago

High number of "version ghost record" is the main symptom, I believe. They can also come from snapshot isolation. Did you enable it for your database or using read committed snapshot on setting or running in SQL Azure?

odinserj commented 2 years ago

Found also this one with some links https://michaeljswart.com/2017/08/problem-with-too-many-version_ghost_records/

dazbradbury commented 2 years ago

High number of "version ghost record" is the main symptom, I believe. They can also come from snapshot isolation. Did you enable it for your database or using read committed snapshot on setting or running in SQL Azure?

Yes, we do use read committed snapshot on. It seems the busiest query in the whole DB is now:

update top (1) JQ set FetchedAt = GETUTCDATE() output INSERTED.Id, INSERTED.JobId, INSERTED.Queue, INSERTED.FetchedAt from [HangFire].JobQueue JQ with (forceseek, readpast, updlock, rowlock) where Queue in (@queues1) and (FetchedAt is null or FetchedAt < DATEADD(second, @timeoutSs, GETUTCDATE())); IF @@ROWCOUNT > 0 RETURN; WAITFOR DELAY @delay; END
odinserj commented 2 years ago

SCAN operator that's used for that query, that's required by the nature of the job queue, in SQL Server acquires row lock for each of those ghost rows each time the query is executed. I don't know why it acquires locks for ghost rows, but I've profiled SQL Server some years ago and this is what the profiler shows. CPU consumption is high because each time query is executed, there are 1,297,600 (I've taken ghost record count number from the table above) or more lock acquires & releases. So we need to understand why SQL Server didn't delete those ghost rows in time.

odinserj commented 2 years ago

SCAN operator that's used for that query, that's required by the nature of the job queue, in SQL Server acquires row lock for each of those ghost rows each time the query is executed. I don't know why it acquires locks for ghost rows, but I've profiled SQL Server some years ago and this is what the profiler shows. CPU consumption is high because each time query is executed, there are 1,297,600 (I've taken ghost record count number from the table above) or more lock acquires & releases. So we need to understand why SQL Server didn't delete those ghost rows in time.