Open ManuelHaas opened 1 month ago
Thank you so much for reporting this! And looks like I need help in validating my logic below, because my results don't make any sense.
Several years ago, in the early days of .NET Core, I began to suspect there are problems with waiting on CancellationToken.WaitHandle
and implemented a workaround with a custom WaitHandle
object in the commit https://github.com/HangfireIO/Hangfire/commit/140b92f5609bff609f63d4b4d9f15a0413f70a4c. Some month ago I removed that logic, thinking there should be no problems with that property anymore, since .NET Core is now much more stable, but decided to keep some reinsurance logic that will help to diagnose this case without causing CPU to run at 100% due to constant retries.
The following piece of code performs waiting on a cancellation logic between some retry attempts, and calls the WaitHandle.WaitOne
method with a timeout value on a CancellationToken
to do so with the possibility to prevent delays during shutdown. All the other logic is to ensure that WaitOne
is working fine.
var stopwatch = Stopwatch.StartNew();
var waitResult = cancellationToken.WaitHandle.WaitOne(timeout); // The main line
stopwatch.Stop();
var timeoutThreshold = TimeSpan.FromMilliseconds(1000);
var elapsedThreshold = TimeSpan.FromMilliseconds(500);
var protectionTime = TimeSpan.FromSeconds(1);
if (!cancellationToken.IsCancellationRequested && // Checking `cancellationToken` hasn't been canceled
timeout >= timeoutThreshold && // Checking `timeout` more than 1 second
stopwatch.Elapsed < elapsedThreshold) // Checking less that 500ms elapsed
{
try
{
var logger = LogProvider.GetLogger(typeof(CancellationTokenExtentions));
logger.Error($"Actual wait time for non-canceled token was '{stopwatch.Elapsed}' instead of '{timeout}', wait result: {waitResult}, using protective wait. Please report this to Hangfire developers.");
}
finally
{
Thread.Sleep(protectionTime);
}
}
WaitOne
is false
, and that means that CancellationToken
instance wasn't canceled, e.g. the corresponding WaitHandle
instance hasn't been signaled (and IsCancellationRequested
check confirms this).WaitOne
was called with at least 1-second timeout
value, according to the log. It is a positive value, not TimeSpan.Zero
, so according to MSDN and common sense, the calling thread should be blocked until the timeout occurs.stopwatch.Elapsed
was less than 500ms.So it's either my logic is wrong (probably), detection logic is wrong (probably) my understanding of WaitOne
method's behavior is wrong (probably), or WaitHandle.WaitOne
method is wrong that's totally improbable – this method has so many usages and expected to be battle tested everywhere.
I would appreciate any help in understanding what's wrong here and in more reports of this behavior.
In version 1.8.15 I will improve the precision to know the exact timings in milliseconds, since TimeSpan.ToString
precision is not enough here.
As far as I can tell, the error only occurs when the server is under heavy load or the host system of the virtual machine is causing problems. A little over two weeks ago, the host system of the virtual machine started to cause problems. On the instance, the error message occurred multiple times every day for 12 days. Three days ago, I moved the instance to a new server and the error message has not occurred since.
Hi,
I got the following entries in my log:
I am reporting it here because that is how it is written in the message.
.NET 8.0.8 Hangfire 1.8.14