Hangfire Load by waits (AAS)

dazbradbury commented 3 years ago

Firstly - thanks for an awesome library!

We're seeing an issue with Hangfire suddenly (in the last 24 hours) consuming a large amount of our database CPU.

The queries that seem to be taking over are:

delete top (1) JQ
output DELETED.Id, DELETED.JobId, DELETED.Queue
from [HangFire].JobQueue JQ with (readpast, updlock, rowlock, forceseek)
where Queue in (@queues1) and (FetchedAt is null or FetchedAt < DATEADD(second, @timeout, GETUTCDATE()))

And the very similar:

delete top (1) JQ
output DELETED.Id, DELETED.JobId, DELETED.Queue
from [HangFire].JobQueue JQ with (readpast, updlock, rowlock, forceseek)
where Queue in (@queues1,@queues2) and (FetchedAt is null or FetchedAt < DATEADD(second, @timeout, GETUTCDATE()))

I've upgraded from Hangfire Core and SQLServer 1.7.15 -> 1.7.24, but the issue remains.

Hangfire is running on a single server, with 4 queues and 121 "workers". Split 1/50/50/20.

Even running a select on the JobQueue table seems to take > 10 seconds. Is there a sensible check we can make to work out what's causing this table to suddenly respond so slowly?

dazbradbury commented 3 years ago

I forgot to say that I'm on schema 7, and this is my configuration if helpful:

GlobalConfiguration.Configuration
               .UseSqlServerStorage(connectionString
               , new Hangfire.SqlServer.SqlServerStorageOptions()
               {
                   UsePageLocksOnDequeue = true,
                   DisableGlobalLocks = true,
                   EnableHeavyMigrations = false,
               }
               )
               .SetDataCompatibilityLevel(CompatibilityLevel.Version_170)
               .UseSimpleAssemblyNameTypeSerializer()
               .UseRecommendedSerializerSettings()
               ;

I've also noticed if I run:

SELECT
OBJECT_NAME(p.OBJECT_ID) AS TableName,
resource_type, resource_description
FROM
sys.dm_tran_locks l
JOIN sys.partitions p ON l.resource_associated_entity_id = p.hobt_id

I get 50k+ PAGE locks back.

Running update statistics seemed to reduce load a little, but not fix the issue sadly. Still investigating...

dazbradbury commented 3 years ago

To provide an update, I changed the configuration from what is shown above to what I understand are the recommended values:

GlobalConfiguration.Configuration
   .UseSqlServerStorage(connectionString
   , new Hangfire.SqlServer.SqlServerStorageOptions()
   {
       CommandBatchMaxTimeout = TimeSpan.FromMinutes(5), // default value seems to be null, 5 minutes is reccommended value
       SlidingInvisibilityTimeout = TimeSpan.FromMinutes(5), // default value seems to be null, 5 minutes is reccommended value
       QueuePollInterval = TimeSpan.Zero, // default value seems to be TimeSpan.FromSeconds(15), zero is reccommended value
       UseRecommendedIsolationLevel = true, // default value is false, true is reccommended value (bizarre?)
       // UsePageLocksOnDequeue = true, // default value is false, reccommended value is unset
       DisableGlobalLocks = true, // reccommended is true
       EnableHeavyMigrations = false, // means we need to manually migrate schemas. This is fine / sensible given otherwise can cause downtime / deadlocks.
   }
   )
   .SetDataCompatibilityLevel(CompatibilityLevel.Version_170)
   .UseSimpleAssemblyNameTypeSerializer()
   .UseRecommendedSerializerSettings()
   ;

This immediately had the impact of stopping the "delete" jobs consuming all our DB CPU time. However, that has been seemingly switched to the following SQL:

update top (1) JQ
set FetchedAt = GETUTCDATE()
output INSERTED.Id, INSERTED.JobId, INSERTED.Queue, INSERTED.FetchedAt
from [HangFire].JobQueue JQ with (forceseek, readpast, updlock, rowlock)
where Queue in (@queues1,@queues2) and
(FetchedAt is null or FetchedAt < DATEADD(second, @timeoutSs, GETUTCDATE()));

With overall CPU usage remaining the same (ie. still being mostly consumed by hangfire queries on the JobQueue table).

dazbradbury commented 3 years ago

OK - so the issue seemed to be with the JobQueue index, which had a high number of pages and ghost rows:

Rebuilding the index whilst hangfire was running, was impossible. However, by turning hangfire off (ie. still able to add jobs, but not running any jobs / grabbing any jobs from the queue) we could rebuild the index using:

ALTER INDEX ALL ON [HangFire].[JobQueue] REBUILD

Note that reorganising wasn't sufficient and a full rebuild was required.

Once this index was rebuilt, it fixed the issue almost immediately. Once things we stable, we also noted similarly high ghost/page counts on other hangfire tables, and subsequently ran:

ALTER INDEX ALL ON [HangFire].[AggregatedCounter] REBUILD
ALTER INDEX ALL ON [HangFire].[Counter] REBUILD
ALTER INDEX ALL ON [HangFire].[Job] REBUILD
ALTER INDEX ALL ON [HangFire].[JobParameter] REBUILD
ALTER INDEX ALL ON [HangFire].[Server] REBUILD
ALTER INDEX ALL ON [HangFire].[Set] REBUILD
ALTER INDEX ALL ON [HangFire].[State] REBUILD

This didn't seem strictly necessary, but seemed sensible to do whilst here. Again, with the hangfire workers all switched off, these rebuilds happen very quickly.

The question that remains is, how do we prevent this from happening again?

odinserj commented 3 years ago

Yes, high number of ghost rows is likely the reason for the high CPU usage. Due to the nature of the Dequeue method that's based on a relational table, rows are scanned in a sequential manner. And the presence of ghost rows cause SQL Server to scan more rows than required (because they take space, they should be scanned anyway) spending more CPU than required.

Those ghost rows are likely to be here due to blocked rows when previous fetching implementation was used (e.g. SlidingInvisibilityTimeout is null), preventing a special background thread in SQL Server to clean them up. When you switched to the sliding-invisibility-timeout-based method, the problem of blocked rows/pages was eliminated and ghost rows destroyer is now able to proceed. But since their number was really high, you haven't seen any result until index was rebuilt.

Normally there will be some ghost rows under the normal conditions, and that special thread will clean them every 5 seconds or so. And the UsePageLocksOnDequeue (works only when SlidingInvisibilityTimeout is set) option will cause Hangfire.SqlServer to use PAGLOCK table hint on dequeue to avoid ghost rows at all (because they are removed instantly when page lock is acquired in the DELETE statement). But this option causes additional blocking on the JobQueue table, that's why it's not the recommended option.

dazbradbury commented 3 years ago

Thanks @odinserj - Will close this issue now, as it seems with the latest recommended configuration the ghost rows no longer accumulate. I'll set up a check to monitor this, and re-open if the issue re-emerges. Thanks for your help!

dazbradbury commented 2 years ago

Hi @odinserj - It looks like this issue re-emerged after all. This query:

SELECT
OBJECT_NAME(p.OBJECT_ID) AS TableName,
resource_type, resource_description
FROM
sys.dm_tran_locks l
JOIN sys.partitions p ON l.resource_associated_entity_id = p.hobt_id

Shows 134k+ PAGE locks on the JobQueue table. We can also see ghost rows are back:

table_view	index_name	version_ghost_record_count	alloc_unit_type_desc
HangFire.Counter	CX_HangFire_Counter	3980000	IN_ROW_DATA
HangFire.AggregatedCounter	IX_HangFire_AggregatedCounter_ExpireAt	101	IN_ROW_DATA
HangFire.Hash	IX_HangFire_Hash_ExpireAt	0	IN_ROW_DATA
HangFire.Job	IX_HangFire_Job_ExpireAt	1517800	IN_ROW_DATA
HangFire.Job	IX_HangFire_Job_ExpireAt	707400	LOB_DATA
HangFire.Job	IX_HangFire_Job_StateName	1524800	IN_ROW_DATA
HangFire.Job	IX_HangFire_Job_StateName	710300	LOB_DATA
HangFire.List	IX_HangFire_List_ExpireAt	0	IN_ROW_DATA
HangFire.Server	IX_HangFire_Server_LastHeartbeat	48	IN_ROW_DATA
HangFire.Set	IX_HangFire_Set_ExpireAt	863434	IN_ROW_DATA
HangFire.Set	IX_HangFire_Set_Score	863519	IN_ROW_DATA
HangFire.AggregatedCounter	PK_HangFire_CounterAggregated	101	IN_ROW_DATA
HangFire.Hash	PK_HangFire_Hash	0	IN_ROW_DATA
HangFire.Job	PK_HangFire_Job	1520900	IN_ROW_DATA
HangFire.Job	PK_HangFire_Job	710400	LOB_DATA
HangFire.JobParameter	PK_HangFire_JobParameter	3125200	IN_ROW_DATA
HangFire.JobQueue	PK_HangFire_JobQueue	1297600	IN_ROW_DATA
HangFire.List	PK_HangFire_List	0	IN_ROW_DATA
HangFire.Schema	PK_HangFire_Schema	0	IN_ROW_DATA
HangFire.Server	PK_HangFire_Server	48	IN_ROW_DATA
HangFire.Set	PK_HangFire_Set	863544	IN_ROW_DATA
HangFire.State	PK_HangFire_State	5862400	IN_ROW_DATA
HangFire.State	PK_HangFire_State	394500	LOB_DATA

Will run the rebuild command again to reset everything, but it's unclear what caused the issue to re-emerge at this time as it seemed stable for a long time following the configuration changes above.