HangfireIO / Hangfire

An easy way to perform background job processing in .NET and .NET Core applications. No Windows Service or separate process required
https://www.hangfire.io
Other
9.33k stars 1.69k forks source link

100% CPU after completing a recurring job #1972

Open onetsmr opened 2 years ago

onetsmr commented 2 years ago

Hi,

We use Hangfire 1.7.27 + Hagfire.Pro 2.3.0.0 + Hagfire.Pro.Redis 2.8.12.0

We are facing 100% CPU usage while hangfire is not processing any jobs after completing a recurring job 01

How does this happen:

  1. If restart the IIS then CPU load is 1-2%
  2. If the only recurring job will be executed at least once then CPU load is 98-99% and does not decrease for a long time

I've collected trace at this moment on the server using dotTrace 02 RunWorkLoop in DedicatedThreadPoolPipeScheduler takes 50% time Wait in CancellationTokenExtentions takes 20% time

This is RunWorkLoop code https://github.com/mgravell/Pipelines.Sockets.Unofficial/blob/19b141f30fe2485dd79993be0ac87a9fc6b34b74/src/Pipelines.Sockets.Unofficial/DedicatedThreadPoolPipeScheduler.cs#L166

odinserj commented 2 years ago

Thanks for the screenshots and profiling. Please note that Hangfire.Pro.Redis doesn't use Pipelines.Sockets.Unofficial, because based on the internalised custom branch of StackExchange.Redis 1.2.6 (so it doesn't matter what version is specified in the project itself). Also, CancellationTokenExtensions.Wait method is based on WaitHandle.WaitOne method that doesn't consume the CPU itself (despite shown in the profiler).

I'm more interested in those Thread Pool threads as they show a lot of activities. Is it still possible to examine their top methods and send some screenshots?

odinserj commented 2 years ago

Also could you please show inner details for the following call?

image

onetsmr commented 2 years ago

We downloaded and used Hagfire.Pro.Redis, not Hangfire.Pro.Redis.SEv2. I will try SEv2.

Thread Pools has 77% native code Screenshot_2 Screenshot_3

RunWorkLoop also has calling ntdll Screenshot_1

This is IIS 10.0 running on 64bit Windows Server 2016. Web service .net framework 4.6

odinserj commented 2 years ago

No, please don't use SEv2 – it's based on StackExchange.Redis 2.X that may be the reason of the issue with high CPU. Hangfire.Pro.Redis is based on the previous version of Redis client that doesn't use Sockets.Pipelines.Unifficial so at least we can isolate things.

odinserj commented 2 years ago

Could you tell me what does your recurring job do – I assume it talks with Redis, right?

onetsmr commented 2 years ago

Recurring job runs a few child jobs (10-100) based on tenants from the sql database. Each child job goes to the sql database, takes 10 records, processes them and re run itself while database has rows for processing. Both types of jobs uses RedLockNet for business locks. And hangfire and RedLockNet goes to the same redis cluster (3 nodes), but uses different keys.

As I understand, I should try to remove things one by one and watch how the situation has changed?

onetsmr commented 2 years ago

Clean recurring job - cpu 1-2% Clean recurring job with redlock - cpu 1-2%

Looks like the reason is not in Hangfire. Sorry.

odinserj commented 2 years ago

What if you add some delay after acquiring and before releasing it? Please note it's better to use async Task.Delay or sync Thread.Sleep, depending on the code that's being replaced.

odinserj commented 2 years ago

Will appreciate if you can share details for those native calls from profiler (that unfolded fat arrow in the bottom tree on the screenshot related to native calls in thread pool threads looks very interesting). Let's investigate this issue, it looks really strange, since all you have is SQL calls – by the way do you mean SQL Server or MySQL?

onetsmr commented 2 years ago

What if you add some delay after acquiring and before releasing it?

I've tried use BackgroundJob.Schedule(..., TimeSpan.FromSecond(1)) - didn't help

MySQL, okay, I will investigate something myself and return back with questions or answers.

Thank you.

onetsmr commented 2 years ago

The problem was in our code - there is a trotting module with semaphores and they are not disposed correctly.