HangfireIO / Hangfire

An easy way to perform background job processing in .NET and .NET Core applications. No Windows Service or separate process required
https://www.hangfire.io
Other
9.44k stars 1.71k forks source link

Hangfire is processing only one job at the time #2354

Closed FixRM closed 10 months ago

FixRM commented 10 months ago

Sorry if it’s know issue, or there is a workaround. I didn’t manage to find something similar.

We have 3 servers in tenant running a batch of recurring jobs. Each server is configured to use 20 worker threads (all servers has 4 virtual cores). After a while we see that is only one job is running at each server at the time.

If we reboot application, HF immediately start to run 20 jobs at the time. What can cause this behavior?

odinserj commented 10 months ago

Hi, what version of Hangfire.Core you are using, and what is your storage package and its version?

FixRM commented 10 months ago

Thank you for answer, @odinserj. Hangfire.Core 1.8.2, storage Hangfire.Redis.StackExchange 1.8.6.

odinserj commented 10 months ago

Thanks, please try upgrade to the latest versions and post an issue in Hangfire.Redis.StackExchange's repository. It's highly likely that the problem occurs there.

FixRM commented 10 months ago

Ok, sure. But why do you think it is a storage problem? I mean that storage is abstracted from job invocation, isn't it?

odinserj commented 10 months ago

It's from the past experience. Another way of troubleshooting there is to run stdump tool to obtain stack traces of all the threads when there are issues and post it here, it should contain useful information that will point us to the right direction.

FixRM commented 10 months ago

Ok, thanks. I already have stdump logs collected at the right time, but I don't understand what should I look for there to find a problem? I can see, threads that I expect to be there, but question is why other does not exist

odinserj commented 10 months ago

Just post them here as they contain only methods with no arguments, and I will take a look.

FixRM commented 10 months ago

Ok, I put is here but I'd love to learn how to read this logs from you.

odinserj commented 10 months ago

Thanks, they look great. Just to ensure - were they captured during the "processing one job at a time" problem? Because after a brief look I see a lot of workers processing jobs, but a lot of they are waiting for something. I will check the details tomorrow and get back to you.

FixRM commented 10 months ago

Not exactly ( I guess there was about 5-10 jobs per farm. Unfortunately we forgot to make shot. This time we have to react and reboot quickly as business critical task was delayed.

FixRM commented 10 months ago

Thank you very much @odinserj. It was our issue with DI implementation.

We have WCF service client that is injected to our jobs. Client is opening WCF connection in it's contractor and got stuck during metadata discovery due to long timeout and network related issues. Technically, Hangfire worker was busy with creating job instance but instance was not running and thus not visible in Dashboard.