Open ChangyeWei opened 3 years ago
we start receiving the same stopping signal after 12 hours run in a couple of servers.
I guess this is because of the application pool recycling..
@zhuweid Our service run on Linux, which would not have any pool recycling issue. I look into the code and found hangfire will remove the server when it found heartbeat timeout. Default setting is 5 min and every 30s will have an heartbeat check. I will try to add some log to indicate its remove server due to the persistent heartbeat check fail.
interesting, we do observe a similar issue in our testing environment, where a long running job (12+ hours) failed after a few hours.
after some investigation, we found our SQL DB was rebooted due to a SQL patch at the time, which caused SQL to be down for nearly 8 minutes. Since default heartbeat timeout is 5 mins, then the hangfire server got removed.
The logging looks like: .. 7/14/2021, 2:25:29.585 AM Server 056q:7732:d21c943a heartbeat successfully sent 7/14/2021, 2:31:18.495 AM 4 servers were removed due to timeout 7/14/2021, 2:31:23.866 AM Server 056q:7732:d21c943a was considered dead by other servers, restarting... 7/14/2021, 2:31:23.867 AM Server 056q:7732:d21c943a caught restart signal... 7/14/2021, 2:31:23.870 AM Server 056q:7732:d21c943a stopped non-gracefully due to ServerWatchdog, 7/14/2021, 2:31:23.900 AM Server 056q:7732:d21c943a successfully reported itself as stopped in 11.5192 7/14/2021, 2:31:23.900 AM Server 056q:7732:d21c943a has been stopped in total 15.7445 ms ...
hangfire version 1.7.18 .NetCore 3.1 deploy on k8s Our service will have 10+ hour job need to run. It will got several remove server error when job running. Not sure if the switch and stop happen when only have short job.
I can find the log of **server * caught stopping signal. So my question is why the service is going to stop? how can I prevent the server from switch/stop?