HangfireIO / Hangfire

An easy way to perform background job processing in .NET and .NET Core applications. No Windows Service or separate process required
https://www.hangfire.io
Other
9.43k stars 1.7k forks source link

who and when will server send/receive "Server rd0003ff2199ca:13116:03c6de91 caught stopping signal" #1883

Open ChangyeWei opened 3 years ago

ChangyeWei commented 3 years ago

hangfire version 1.7.18 .NetCore 3.1 deploy on k8s Our service will have 10+ hour job need to run. It will got several remove server error when job running. Not sure if the switch and stop happen when only have short job.

I can find the log of **server * caught stopping signal. So my question is why the service is going to stop? how can I prevent the server from switch/stop?

Winbackissue

zhuweid commented 3 years ago

we start receiving the same stopping signal after 12 hours run in a couple of servers.

I guess this is because of the application pool recycling..

ChangyeWei commented 3 years ago

@zhuweid Our service run on Linux, which would not have any pool recycling issue. I look into the code and found hangfire will remove the server when it found heartbeat timeout. Default setting is 5 min and every 30s will have an heartbeat check. I will try to add some log to indicate its remove server due to the persistent heartbeat check fail.

zhuweid commented 3 years ago

interesting, we do observe a similar issue in our testing environment, where a long running job (12+ hours) failed after a few hours.

after some investigation, we found our SQL DB was rebooted due to a SQL patch at the time, which caused SQL to be down for nearly 8 minutes. Since default heartbeat timeout is 5 mins, then the hangfire server got removed.

The logging looks like: .. 7/14/2021, 2:25:29.585 AM Server 056q:7732:d21c943a heartbeat successfully sent 7/14/2021, 2:31:18.495 AM 4 servers were removed due to timeout 7/14/2021, 2:31:23.866 AM Server 056q:7732:d21c943a was considered dead by other servers, restarting... 7/14/2021, 2:31:23.867 AM Server 056q:7732:d21c943a caught restart signal... 7/14/2021, 2:31:23.870 AM Server 056q:7732:d21c943a stopped non-gracefully due to ServerWatchdog, 7/14/2021, 2:31:23.900 AM Server 056q:7732:d21c943a successfully reported itself as stopped in 11.5192 7/14/2021, 2:31:23.900 AM Server 056q:7732:d21c943a has been stopped in total 15.7445 ms ...