Actual wait time for non-canceled token was '00:00:00' instead of '00:00:01', wait result: False, using protective wait. Please report this to Hangfire developers.

ManuelHaas commented 1 month ago

Hi,

I got the following entries in my log:

2024-09-26 02:26:04.697 +02:00 [ERR] Actual wait time for non-canceled token was '00:00:00' instead of '00:00:01', wait result: False, using protective wait. Please report this to Hangfire developers.
2024-09-26 02:31:07.708 +02:00 [ERR] Actual wait time for non-canceled token was '00:00:00' instead of '00:00:01', wait result: False, using protective wait. Please report this to Hangfire developers.

I am reporting it here because that is how it is written in the message.

.NET 8.0.8 Hangfire 1.8.14

odinserj commented 1 month ago

Thank you so much for reporting this! And looks like I need help in validating my logic below, because my results don't make any sense.

Several years ago, in the early days of .NET Core, I began to suspect there are problems with waiting on CancellationToken.WaitHandle and implemented a workaround with a custom WaitHandle object in the commit https://github.com/HangfireIO/Hangfire/commit/140b92f5609bff609f63d4b4d9f15a0413f70a4c. Some month ago I removed that logic, thinking there should be no problems with that property anymore, since .NET Core is now much more stable, but decided to keep some reinsurance logic that will help to diagnose this case without causing CPU to run at 100% due to constant retries.

The following piece of code performs waiting on a cancellation logic between some retry attempts, and calls the WaitHandle.WaitOne method with a timeout value on a CancellationToken to do so with the possibility to prevent delays during shutdown. All the other logic is to ensure that WaitOne is working fine.

var stopwatch = Stopwatch.StartNew();
var waitResult = cancellationToken.WaitHandle.WaitOne(timeout); // The main line
stopwatch.Stop();

var timeoutThreshold = TimeSpan.FromMilliseconds(1000);
var elapsedThreshold = TimeSpan.FromMilliseconds(500);
var protectionTime = TimeSpan.FromSeconds(1);

if (!cancellationToken.IsCancellationRequested && // Checking `cancellationToken` hasn't been canceled
    timeout >= timeoutThreshold && // Checking `timeout` more than 1 second
    stopwatch.Elapsed < elapsedThreshold) // Checking less that 500ms elapsed
{
    try
    {
        var logger = LogProvider.GetLogger(typeof(CancellationTokenExtentions));
        logger.Error($"Actual wait time for non-canceled token was '{stopwatch.Elapsed}' instead of '{timeout}', wait result: {waitResult}, using protective wait. Please report this to Hangfire developers.");
    }
    finally
    {
        Thread.Sleep(protectionTime);
    }
}

From the messages above we see that the result value of WaitOne is false, and that means that CancellationToken instance wasn't canceled, e.g. the corresponding WaitHandle instance hasn't been signaled (and IsCancellationRequested check confirms this).
WaitOne was called with at least 1-second timeout value, according to the log. It is a positive value, not TimeSpan.Zero, so according to MSDN and common sense, the calling thread should be blocked until the timeout occurs.
But the message says that stopwatch.Elapsed was less than 500ms.

So it's either my logic is wrong (probably), detection logic is wrong (probably) my understanding of WaitOne method's behavior is wrong (probably), or WaitHandle.WaitOne method is wrong that's totally improbable – this method has so many usages and expected to be battle tested everywhere.

I would appreciate any help in understanding what's wrong here and in more reports of this behavior.

In version 1.8.15 I will improve the precision to know the exact timings in milliseconds, since TimeSpan.ToString precision is not enough here.

ManuelHaas commented 1 month ago

As far as I can tell, the error only occurs when the server is under heavy load or the host system of the virtual machine is causing problems. A little over two weeks ago, the host system of the virtual machine started to cause problems. On the instance, the error message occurred multiple times every day for 12 days. Three days ago, I moved the instance to a new server and the error message has not occurred since.

HangfireIO / Hangfire

Actual wait time for non-canceled token was '00:00:00' instead of '00:00:01', wait result: False, using protective wait. Please report this to Hangfire developers. #2447