dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.92k stars 4.64k forks source link

Task.Delay(timeout) completes before the specified timeout has expired #100455

Open zlatko-michailov opened 5 months ago

zlatko-michailov commented 5 months ago

Description

Task.Delay(timeout) completes before the specified timeout has expired. This behavior reproes both when a millisecond count is used and when a TimeSpan constructed with ticks is used.

Reproduction Steps

Run the following code. After a minute or so, entries will start popping. Not passing any cmd line argument will use a millisecond count. Passing a ticks argument will use a TimeSpan constructed from integer clock ticks. Passing a stopwatch argument will use a Stopwatch.


using System.Diagnostics;

class Program
{
    const int expectedMs = 1000;

    static async Task Main(string[] args)
    {
        bool ticks = false;
        bool stopwatch = false;

        foreach (string arg in args)
        {
            ticks = ticks || arg == "ticks";
            stopwatch = stopwatch || arg == "stopwatch";
        }

        await RunTest(ticks, stopwatch);
    }

    static async Task RunTest(bool ticks, bool stopwatch)
    {
        Console.WriteLine($"Using ticks={ticks}, stopwatch={stopwatch}");

        for (int i = 0; i < 1000000; i++)
        {
            DateTime beforeTime = DateTime.UtcNow;
            Stopwatch sw = Stopwatch.StartNew();
            if (ticks)
            {
                await Task.Delay(TimeSpan.FromTicks(expectedMs * TimeSpan.TicksPerMillisecond));
            }
            else
            {
                await Task.Delay(expectedMs);
            }
            sw.Stop();
            DateTime afterTime = DateTime.UtcNow;

            TimeSpan duration = afterTime - beforeTime;
            double actualMs = duration.TotalMilliseconds;
            if (stopwatch)
            {
                actualMs = sw.ElapsedMilliseconds;
            }

            if (actualMs < expectedMs)
            {
                Console.WriteLine($"{i}: actual={actualMs}, expected={expectedMs}, {beforeTime:O}, {afterTime:O}");
            }
        }
    }
}```

### Expected behavior

The TimeSpan from `beforeTime` to `afterTime` must always be higher than 1,000 ms.
Taking the time before and after the await takes some time. Leaving the initial thread, and scheduling the continuation should also take some time.

### Actual behavior

In some cases the TimeSpan between `beforeTime` and `afterTime` is less than 1,000 ms.

### Regression?

Don't know.

### Known Workarounds

Some approximation was considered. It was ruled out. The Task class is at the bottom of async programming. This looks like a bug that should be fixed.

### Configuration

_No response_

### Other information

_No response_
huoyaoyuan commented 5 months ago

The precise parameter doesn't matter. Task.Delay always converts the time to integer milliseconds.

DateTime.UtcNow has it's own overhead and inaccuracy. Stopwatch should be used for more precise time checking.

Clockwork-Muse commented 5 months ago

As an example - although it's less of a case now, it used to be that the system clock would have a resolution of ~15ms (for all languages - this was mostly a hardware/OS problem). This causes a number of amusing side effects.... I'm not sure that the value the clock would report was guaranteed to be the "earliest" value, as opposed to just a value, so duration from getting two clock reads and subtracting them would be up to ~30ms off.....

zlatko-michailov commented 5 months ago

I allowed myself to edit the repro by adding an option to use a Stopwatch.

Indeed, the repro with a Stopwatch is less frequent, but it still exists:

591: actual=998, expected=1000, 2024-03-31T04:46:00.3957635Z, 2024-03-31T04:46:01.3937879Z 820: actual=999, expected=1000, 2024-03-31T04:49:51.2193977Z, 2024-03-31T04:49:52.2190058Z

zlatko-michailov commented 5 months ago

As this is the Runtime project, all three - Task, DateTime, and Stopwatch should fall in here. Make them work together.

I am not asking that Task.Delay() is accurate in absolute/atomic time, not even in machine time. I am asking that it is accurate in dotnet terms within the same dotnet process.

tannergooding commented 5 months ago

Could you share which version of .NET, which Operating System (including version), and which CPU you're running on?

Notably this doesn't repro on Windows 10 or 11 for me, across any of my AMD, Intel, or Arm64 systems (all of which are 2015 or later). I haven't yet tried against any Linux systems.

huoyaoyuan commented 5 months ago

It reproduces for me after about 5 minutes. I'm not sure what are affecting the precisions.

zlatko-michailov commented 5 months ago

@tannergooding

Please don't blame it on the OS or on the machine clock. I doubt there is a computer whose clock would ever go backwards. Chances are Task.Delay() overdoes its effort not to start the continuation too late.

tannergooding commented 5 months ago

Please don't blame it on the OS or on the machine clock.

.NET itself isn't doing any special logic to allow pre-firing, it's simply querying the underlying system tick count APIs and using the relevant threading and other APIs.

The .NET code is fully open and you can go and double check what it does yourself if you want to be sure:

The last of which clearly uses exact thresholds, not fuzziness. Only relying on TickCount64 which defers to Environment.TickCount64 on Unix (which defers to clock_gettime_nsec_np, CLOCK_MONOTONIC_COARSE or CLOCK_MONOTONIC) and QueryUnbiasedInterruptTime on Windows 8+ (which takes the 100ns count and turns it into ms using integer division, which truncates).

You then have TimerQueue.EnsureTimerFiresBy(uint): https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/Threading/Timer.cs,48ca530eebd59912

This ensures its scheduled and similarly uses exact tick counts, only firing if greater than or equal.

You then have the actual timer thread which is either the portable version (such as on Unix): https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/Threading/TimerQueue.Portable.cs,01f8a0f2623d1904,references

or uses the windows threadpool (default for windows) which calls CreateThreadpoolTimer and SetThreadpoolTimer: https://github.com/dotnet/runtime/blob/cc7bf831f02cad241547ebea5c56c82f12a50999/src/libraries/System.Private.CoreLib/src/System/Threading/TimerQueue.WindowsThreadPool.cs

I doubt there is a computer whose clock would ever go backwards

This is a very common thing and why you have different clock sources only some of which are guaranteed to be monotonic in nature.

.NET currently ensures we use monotonic clocks. However, many such clocks are not strictly guaranteed to always be monotonic across threads or separate CPU cores.

Documents like https://learn.microsoft.com/en-us/windows/win32/api/realtimeapiset/nf-realtimeapiset-queryunbiasedinterrupttime, https://learn.microsoft.com/en-us/windows/win32/winmsg/about-timers, https://learn.microsoft.com/en-us/windows/win32/sysinfo/acquiring-high-resolution-time-stamps, or even the POSIX specific docs around CLOCK_* cover a lot of these nuances and what systems can impact the reported measurements, how they can be subject to cross cores considerations, etc.

Windows specifically calls out, as an example:

When you compare performance counter results that are acquired from different threads, consider values that differ by ± 1 tick to have an ambiguous ordering. If the time stamps are taken from the same thread, this ± 1 tick uncertainty doesn't apply. In this context, the term tick refers to a period of time equal to 1 ÷ (the frequency of the performance counter obtained from QueryPerformanceFrequency).

There are then also other factors to consider, such as whether values displayed using double are exact or if values that are computed using division are themselves exact.

A simple example is that Stopwatch.ElapsedMilliseconds does GetElapsedDateTimeTicks() / TicksPerMillisecond as an integer division, thus this will truncate. GetElapsedDateTimeTicks() is then itself (long)(GetRawElapsedTicks() * s_tickFrequency) and thus is taking an int64 converting it to double, scaling by something which is likely not exactly representabled as a double, and then casting back to long (which truncates). Thus, it is entirely reasonable that due to the double truncation and general inaccuracy of double in creating a display value, we take something that was 1000ms in terms of say 100ns ticks and it converts down to 999ms instead when trying to display the value. Therefore more data would need to be displayed to determine if it was actually 1000ms or not.

Given how the actual timer APIs are coded and that they aren't doing anything that would cause underreporting in times, that I can see, I expect you could reproduce the issue in C/C++ on both Linux and Windows and it's likely coming down to threading and some minor differences in the underlying OS APIs and hardware timers.

Clockwork-Muse commented 5 months ago

As a practical matter, if you actually want millisecond accuracy on your task delays, you're probably doing something awkward or better handled by something else. What is it you're doing that you want this accuracy?

zlatko-michailov commented 5 months ago

In Prod, we have a cron-like loop that fires multiple jobs, each of which may have a different frequency. The loop wakes up frequently (every 1 minute) and each job decides whether it should execute or ignore the call, based on its own schedule. In our test, the loop wakes up every 1 second, and there are 2 jobs - one that executes every 1 second, i.e. supposedly on every fire from the loop, and another one that executes every 2 seconds. The first job has misses - a lot more frequently than in the simple repro I provided.

I figured the simple repro would be more stable, that's why I reduced the loop delay to 100 ms. Otherwise, it would take too long to repro. 100 ms is coarse-enough for any clock to satisfy, IMO.

I also provided the "precise" option in the repro to avoid inaccuracy from floating point division (at least on my end), but the problem still reproes.

tannergooding commented 5 months ago

100 ms is coarse-enough for any clock to satisfy, IMO.

The consideration wouldn't be if it's coarse enough, it's that if you have say a clock with 15ms granularity, then you can represent 90ms or 105ms, but not 100ms.

I figured the simple repro would be more stable, that's why I reduced the loop delay to 100 ms.

I ran this all night at 1000ms and got exactly 1 trigger where stopwatch reported 999ms but where if you observe sw.ElapsedTicks and Stopwatch.Frequency it was actually 1 second worth of time.

Could you update your own local repro to dump the following information

Console.WriteLine($"{i}: actual={actualMs} (ticks={sw.ElapsedTicks}, freq={Stopwatch.Frequency}), expected={expectedMs}, {beforeTime:O}, {afterTime:O} (deltaTicks={(afterTime - beforeTime).Ticks})");

In particular, we're most interested in the raw elapsed ticks/frequency of the stopwatch; and the raw tick delta for the DateTime to ensure that there is no issues being shown purely to display reasons.

tannergooding commented 5 months ago

I did get a single trigger with the following program:

// See https://aka.ms/new-console-template for more information
using System.Diagnostics;

class Program
{
    const int expectedMs = 100;

    static async Task Main(string[] args)
    {
        bool ticks = false;
        bool stopwatch = false;

        foreach (string arg in args)
        {
            ticks = ticks || arg == "ticks";
            stopwatch = stopwatch || arg == "stopwatch";
        }

        await RunTest(ticks, stopwatch);
    }

    static async Task RunTest(bool ticks, bool stopwatch)
    {
        Console.WriteLine($"Using ticks={ticks}, stopwatch={stopwatch}");

        for (int i = 0; i < 1000000; i++)
        {
            if (i % 1000 == 0)
            {
                Console.WriteLine(i);
            }

            int firstProcId = Thread.GetCurrentProcessorId();

            long beforeTicks = Environment.TickCount64;
            DateTime beforeTime = DateTime.UtcNow;
            Stopwatch sw = Stopwatch.StartNew();
            if (ticks)
            {
                await Task.Delay(TimeSpan.FromTicks(expectedMs * TimeSpan.TicksPerMillisecond));
            }
            else
            {
                await Task.Delay(expectedMs);
            }
            sw.Stop();
            DateTime afterTime = DateTime.UtcNow;
            long afterTicks = Environment.TickCount64;

            int lastProcId = Thread.GetCurrentProcessorId();

            TimeSpan duration = afterTime - beforeTime;
            double actualMs = duration.TotalMilliseconds;
            if (stopwatch)
            {
                actualMs = sw.ElapsedMilliseconds;
            }

            if (actualMs < expectedMs)
            {
                Console.WriteLine($"{i}: procId={firstProcId}->{lastProcId}; Stopwatch - actual={actualMs} (ticks={sw.ElapsedTicks}, freq={Stopwatch.Frequency}), expected={expectedMs}; DateTime: {beforeTime:O}, {afterTime:O} (deltaTicks={(afterTime - beforeTime).Ticks}); TickCount64 - {beforeTicks}, {afterTicks}");
            }
        }
    }
}

Where it gave:

42: procId=31->5; Stopwatch - actual=99 (ticks=999915, freq=10000000), expected=100; DateTime: 2024-04-01T17:45:18.4304004Z, 2024-04-01T17:45:18.5303925Z (deltaTicks=999921); TickCount64 - 232966468, 232966578

You can see that:

Clockwork-Muse commented 5 months ago

The loop wakes up frequently (every 1 minute) and each job decides whether it should execute or ignore the call, based on its own schedule

So then you don't actually care about the precision per se, you're building a scheduling system, and the usual concerns (spurious wakes, essentially, here) apply.

Is there a reason you're building your own, instead of using one of the existing packages/applications available in the ecosystem (eg, Quartz)?

zlatko-michailov commented 5 months ago

@tannergooding I see you have reproed that all timer-based measurements can return < 1000 ms. (To repro Environment.TickCount64, you'll have to include it in the condition. I haven't tried it though.) Do you still need any further input from me?

tannergooding commented 5 months ago

This looks to be an OS issue, not a .NET issue, it's likely something that needs to be filed with Windows/Linux/MacOS, etc

.NET could potentially use stopwatch explicitly and not the much more coarse tick count as a way to avoid the issue, but that may not strictly be better.

zlatko-michailov commented 5 months ago

I was hoping dotnet could use its own clock abstraction on top of the hardware clock, and guarantee fidelity among the time/duration-related APIs.

tannergooding commented 5 months ago

The issue here is that an underlying OS API appears to be returning inaccurate numbers in some cases. There's nothing that .NET can do to fix that except switch to an entirely different timer source.

On Windows 8 and above, all we're doing is calling QueryUnbiasedInterruptTime and dividing the 100ns ticks by 10_000, thus converting it to milliseconds. Since it's integer division, we're then truncating any partial precision.

Similarly, we by default simply use SetThreadpoolTimer which is meant to operate in 100ns ticks and short of that, fallback to QueryUnbiasedInterruptTime.

There's nothing "better" we can do, especially that wouldn't be much more expensive and wouldn't be subject to the same potential issues if the OS sometimes reports inaccurate numbers.

zlatko-michailov commented 5 months ago

I tweaked my repro to use all three methods - DateTime.UtcNow, Stopwatch, and Environment.TickCount64. If any of the three returns a duration that is under 100ms, a message is logged to the console.

Results on Windows:

Using ticks=False 780: expected=1000, actualDurationMs=999.9728, actualStopwatchMs=999, actualTickCountMs=1000 1860: expected=1000, actualDurationMs=999.9935, actualStopwatchMs=999, actualTickCountMs=1000 1981: expected=1000, actualDurationMs=1000.0029, actualStopwatchMs=999, actualTickCountMs=1000 2162: expected=1000, actualDurationMs=999.6005, actualStopwatchMs=999, actualTickCountMs=1000 2280: expected=1000, actualDurationMs=999.9874, actualStopwatchMs=999, actualTickCountMs=1000 2600: expected=1000, actualDurationMs=999.9641, actualStopwatchMs=999, actualTickCountMs=1000 2624: expected=1000, actualDurationMs=999.9983, actualStopwatchMs=999, actualTickCountMs=1000 2636: expected=1000, actualDurationMs=999.9606, actualStopwatchMs=999, actualTickCountMs=1000 2687: expected=1000, actualDurationMs=999.9844, actualStopwatchMs=999, actualTickCountMs=1000 3510: expected=1000, actualDurationMs=999.9831, actualStopwatchMs=999, actualTickCountMs=1000 3522: expected=1000, actualDurationMs=1000.0026, actualStopwatchMs=999, actualTickCountMs=1000

Using ticks=True 246: expected=1000, actualDurationMs=1000.0081, actualStopwatchMs=999, actualTickCountMs=1000 861: expected=1000, actualDurationMs=999.9981, actualStopwatchMs=999, actualTickCountMs=1000 1014: expected=1000, actualDurationMs=1000.0053, actualStopwatchMs=999, actualTickCountMs=1000 2484: expected=1000, actualDurationMs=999.9792, actualStopwatchMs=999, actualTickCountMs=1000 2558: expected=1000, actualDurationMs=999.995, actualStopwatchMs=999, actualTickCountMs=1000 2896: expected=1000, actualDurationMs=999.9237, actualStopwatchMs=999, actualTickCountMs=1000 2930: expected=1000, actualDurationMs=999.97, actualStopwatchMs=999, actualTickCountMs=1000 3051: expected=1000, actualDurationMs=1000.0041, actualStopwatchMs=999, actualTickCountMs=1000

Apparently using ticks in Task.Delay() doesn't make any difference. It looks like both DateTime.UtcNow and Stopwatch suffer from some rounding, which more severe with Stopwotch. That makes Stopwatch the worst choice. The good news is there is a winner that never failed - Environment.TickCount64. I don't know whether the rounding is different or whether it uses a different underlying mechanism to poll the time. But the net result is it is better than the other two methods.

On WSL on the same Windows machine, the results are marginally, but not fundamentally, different:

Using ticks=False 132: expected=1000, actualDurationMs=990.6424, actualStopwatchMs=990, actualTickCountMs=1000 290: expected=1000, actualDurationMs=991.4873, actualStopwatchMs=991, actualTickCountMs=1000 399: expected=1000, actualDurationMs=990.5667, actualStopwatchMs=990, actualTickCountMs=1000 451: expected=1000, actualDurationMs=992.5291, actualStopwatchMs=992, actualTickCountMs=1000 733: expected=1000, actualDurationMs=990.4435, actualStopwatchMs=990, actualTickCountMs=1000 768: expected=1000, actualDurationMs=992.2823, actualStopwatchMs=992, actualTickCountMs=1000 823: expected=1000, actualDurationMs=993.9202, actualStopwatchMs=993, actualTickCountMs=1001 841: expected=1000, actualDurationMs=990.766, actualStopwatchMs=990, actualTickCountMs=1000 882: expected=1000, actualDurationMs=995.5455, actualStopwatchMs=995, actualTickCountMs=1000 912: expected=1000, actualDurationMs=991.0338, actualStopwatchMs=991, actualTickCountMs=1000 945: expected=1000, actualDurationMs=990.7209, actualStopwatchMs=990, actualTickCountMs=1000

The error from DateTime.UtcNow and Stopwatch is bigger - this is no longer floating-point rounding. But the pattern is the same - both of them suffer from the same inaccuracy. Environment.TickCount64 doesn't fail here either.

So, to the statement "There's nothing that .NET can do to fix that except switch to an entirely different timer source." - There is obviously something .NET can do - switch to the same mechanism that Environment.TickCount64 uses.

Here's the current repro:

using System.Diagnostics;

class Program
{
    const int expectedMs = 1000;

    static async Task Main(string[] args)
    {
        bool ticks = false;

        foreach (string arg in args)
        {
            ticks = ticks || arg == "ticks";
        }

        await RunTest(ticks);
    }

    static async Task RunTest(bool ticks)
    {
        Console.WriteLine($"Using ticks={ticks}");

        for (int i = 0; i < 1000000; i++)
        {
            DateTime beforeTime = DateTime.UtcNow;
            Stopwatch sw = Stopwatch.StartNew();
            long beforeTickCount = Environment.TickCount64;
            if (ticks)
            {
                await Task.Delay(TimeSpan.FromTicks(expectedMs * TimeSpan.TicksPerMillisecond));
            }
            else
            {
                await Task.Delay(expectedMs);
            }
            long afterTickCount = Environment.TickCount64;
            sw.Stop();
            DateTime afterTime = DateTime.UtcNow;

            TimeSpan duration = afterTime - beforeTime;
            double actualDurationMs = duration.TotalMilliseconds;
            long actualStopwatchMs = sw.ElapsedMilliseconds;
            long actualTickCountMs = afterTickCount - beforeTickCount;

            if (actualDurationMs < expectedMs || actualStopwatchMs < expectedMs || actualTickCountMs < expectedMs)
            {
                Console.WriteLine($"{i}: expected={expectedMs}, actualDurationMs={actualDurationMs}, actualStopwatchMs={actualStopwatchMs}, actualTickCountMs={actualTickCountMs}");
            }
        }
    }
}
tannergooding commented 5 months ago

The good news is there is a winner that never failed - Environment.TickCount64. I don't know whether the rounding is different or whether it uses a different underlying mechanism to poll the time. But the net result is it is better than the other two methods.

I think you're somewhat misinterpreting the results here.

It's not that actualTickCountMs (which is to say Environment.TickCount64) is more accurate. It's rather instead likely that its the thing introducing the rounding error and reporting 1000 even though only 999.9728ms have elapsed. Task.Delay then using Environment.TickCount64 in some cases would cause it to fire less than a millisecond before actually scheduled if such rounding error exists.

You should be able to identify this by displaying sw.ElapsedTicks + Stopwatch.Frequence instead of sw.ElapsedMilliseconds, as a simple example, showing there is no rounding error. The same goes for displaying duration.Ticks rather than duration.TotalMilliseconds.

Clockwork-Muse commented 5 months ago

Also, you seem to be assuming that there's something in the C# runtime that's just observing some counter and then firing the tasks that instant. That's not (entirely) what's happening - the thing that's ultimately firing the tasks is an OS/hardware utility, and due to the layers involved there's a bit of wiggle in when the task is resumed "on time" (for various reasons).
You can't get around this. The runtime handles some "very early" spurious wakes by requeuing them with the remaining time, but anything due before the next "clock tick" (of some resolution) is usually just handled immediately so that the task is mostly on time.

Tasks (and Task.Delay) are not meant for millisecond accuracy - if you actually need something like that, you need media timers, which are implemented differently.

As a practical matter, I'll again ask why you're implementing your own task scheduler, instead of using one of the (quite robust) existing implementations.

glen-84 commented 1 month ago

I think that we have the same issue here:

Assert.InRange() Failure: Value not in range Range: (1000 - 2000) Actual: 998

Test Source

I was thinking of using TimeProvider.Delay instead, but it seems difficult to mock (being an extension method).

I might just reduce the range to 995 for now.

Clockwork-Muse commented 1 month ago

I was thinking of using TimeProvider.Delay instead, but it seems difficult to mock (being an extension method).

The way the extension method works, it calls TimeProvider.CreateTimer, so you would mock/implement that method instead.

I might just reduce the range to 995 for now.

Traditionally clock resolutions are ~15ms, so you might want 985 instead.