JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.91k stars 5.49k forks source link

Feature request: Finer-grained sleep function #54971

Open MilesCranmer opened 5 months ago

MilesCranmer commented 5 months ago

Base.sleep has a minimum time of 1 millisecond. This is very coarse compared to other high-performance languages like C and Rust and can cause some issues for some applications (see thread) such as periodically polling workers without blocking the main thread.

In other languages, for example, Python can actually get fairly high resolution with the default time.sleep (although its blocking):

[ins] In [3]: %%timeit
         ...: time.sleep(1e-6)
         ...:
         ...:
4.52 µs ± 570 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

which calls the corresponding system sleep.

Rust has:

async_std::task::sleep(Duration::from_nanos(1)).await;

which calls the system sleep in a thread.

One workaround is to use Libc.systemsleep. However, this blocks the calling thread. Similarly, one can use time() in a while-loop, but this also blocks.

One option which I would like comments on is the following function which is basically a non-blocking call of usleep by the use of @threadcall:

function systemsleep(seconds::Real)
    microseconds = round(UInt32, 1e6 * seconds)
    @threadcall(:usleep, UInt32, (UInt32,), microseconds)
    return nothing
end

This is OS-dependent and on some systems will not be better than sleep. But it's at least better for unix systems. With this function, one can get a higher resolution than Base.sleep:

julia> @btime systemsleep(1e-4)
  126.625 μs (11 allocations: 496 bytes)

julia> @btime sleep(1e-4)
  1.049 ms (4 allocations: 112 bytes)

with the smallest time being about 20us (presumably due to overhead of threadcall)

julia> @btime systemsleep(0)
  17.333 μs (11 allocations: 496 bytes)

There is also nanosleep which is another option to consider for this task: https://linux.die.net/man/2/nanosleep.

ufechner7 commented 5 months ago

Looks similar to https://github.com/JuliaLang/julia/issues/12770

MilesCranmer commented 5 months ago

Wow, 2015! Okay maybe this just needs a PR at this point rather than an issue..

stevengj commented 5 months ago

@threadcall(:nanosleep, Int, (Int,), nanoseconds)

This is not the API of nanosleep.

MilesCranmer commented 5 months ago

You are fast! I realised my mistake and deleted the comment about 30 seconds after posting 😅 Cunningham's law shows no mercy

ufechner7 commented 5 months ago

@MilesCranmer Feel free to create a pull request for https://github.com/ufechner7/Timers.jl if you have a good idea for a new sleep function and what to test it out before creating a pull request to Julia directly.

MilesCranmer commented 5 months ago

Thanks, I will probably just go with this code for now:

function systemsleep(seconds::Real)
    microseconds = round(Int, 1e6 * seconds)
    @threadcall(:usleep, Int, (Int,), microseconds)
    return nothing
end

It's small enough that I will just stick it directly in my library, but thanks for the offer.

nanosleep would likely need a change to the C side of Julia because it takes the timespec struct as input. And due to the overhead of @threadcall, we wouldn't be able to get to nanosecond precision anyways, so there's not much point in using nanosleep for a non-blocking call over usleep.

I think a "proper" solution to all of this should handle it directly in C with nanosleep and use jl_yield. I'm happy to help add it if Julia devs are interested.

stevengj commented 5 months ago

nanosleep would likely need a change to the C side of Julia because it takes the timespec struct as input.

usleep has a similar issue because useconds_t is a system-dependent width. Moreover, nanosleep can (perversely) sleep for much longer intervals that usleep because of the timespec.

It's not particularly difficult for us to add a new jl_nanosleep function wrapping this in a system-independent API. (What would we do on Windows? There doesn't seem to be a closely analogous Windows function?)

MilesCranmer commented 5 months ago

I think for Windows it could just fall back to regular sleep to start with. Later it could use https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/high-resolution-timers like what Python does: https://docs.python.org/3/library/time.html#time.sleep (only Windows 8.1+)

MilesCranmer commented 5 months ago

Relevant code in CPython: https://github.com/python/cpython/blob/81a654a3425eaa05a51342509089533c1f623f1b/Modules/timemodule.c#L2164-L2344

oscardssmith commented 5 months ago

How do these short system sleeps compare to

function yieldsleep(t)
    t1 = time()
    while time()-t1<t
        yield()
    end
end

It looks like it works pretty well down to ~10 microsecond (and below that, you probably should just busy-wait).

julia> @benchmark yieldsleep(1e-5)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   9.057 μs …  29.295 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     10.069 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):    9.871 μs ± 621.258 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▃▂                     ▅█▆▁                              
  ▂▃▃▄▇███▄▃▃▂▁▁▁▁▁▁▁▁▁▁▂▂▃▄▅█████▅▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁ ▂
  9.06 μs         Histogram: frequency by time         11.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark yieldsleep(1e-6)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.507 μs …   6.189 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.711 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.711 μs ± 185.194 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▁█▇▃       ▂▇▇▁        ▁                         
  ▃▅▅▅▃▂▂▂▂▂▂▃████▅▂▂▂▂▃▅████▅▂▂▂▂▂▃███▇▂▂▂▂▂▂▂▂▃▃▃▂▂▁▁▁▁▂▂▂▂ ▃
  1.51 μs         Histogram: frequency by time        2.01 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
MilesCranmer commented 5 months ago

I guess the downside of this yieldsleep is that it would occupy CPU time (from other processes) if nothing else is running in the thread?

oscardssmith commented 5 months ago

That's fair, the better version of this function probably includes a check for whether t>1e-3 in which case it just does a regular `sleep. For 1ms or less, worrying about the extra power consumption of keeping a single core alive is probably a mistake.

MilesCranmer commented 5 months ago

But this would really depend on the context as usually you wouldn't have a one-off sleep call but loop over them (like, say, polling workers or checking a signal). If I'm sleeping for N% of the time, regardless of the duration of each individual sleep, it would still mean N% redundant CPU usage spinning the core (in the case of a "busy" sleep) – whereas the system sleep would be closer to 0%

oscardssmith commented 5 months ago

I don't think you can get accurate times and low CPU usage with a loop of short sleeps. (since if you yield back to the OS, you can't know that you will be scheduled in time). If you are using this to try to respond to incoming events, you really want a push based system (or something like wait that is integrated into the Julia scheduler)

MilesCranmer commented 5 months ago

Just to be clear I didn't mean using a loop of short sleeps to emulate a single long sleep... I meant doing small tasks with short sleeps between them.

MilesCranmer commented 5 months ago

Another example: I want to check if a file exists, and check regularly (<1ms) so I can react to updates. This is where a short sleep separating reads is useful. You could also find this kind of pattern in control applications, graphics, or networking (with various requirements for resolution). I'm not sure wait can save us here, because the task being waited on would be another Julia task that would need to continuously spin a CPU core (or sleep itself). Short-duration sleep is a fairly common utility in other languages so seems like a good thing to have in general, if anything just for allowing common programming patterns.

oscardssmith commented 5 months ago

For the file existence example, there are OS APIs to receive notifications that a file has been created/destroyed etc and C# for example, has a platform independent API that wraps them https://learn.microsoft.com/en-us/dotnet/api/system.io.filesystemwatcher?view=net-8.0 OnCreated(FileSystemEventArgs). Specifically, for this use-case, you would call OnCreated(FileSystemEventArgs) which will call your registered callback whenever a file is created (that matches the filtering conditions).

I agree with your general point that we should make sleep more reliable for short sleeps (possibly via wait), but yielding back to the OS for single digit microseconds is a bad idea. The only advantage of sleep compared to yield is that sleep will let your computer do other work (or go into a low power mode), but if you're only sleeping for a few microseconds, there isn't enough time for much to happen anyway (since chips take a while to go into low power modes and going to the OS scheduler and having to clean out all your registers and cache takes a while).

vtjnash commented 5 months ago

I have looked into adding a higher precision mode to libuv, but on Windows (contrary to the python documentation, but per the kernel documentation linked to by the python documentation as its citation), the minimum possible sleep request is 1ms, and often it will round-up significantly more (usually up to the nearest 15ms interval). As Oscar mentioned, there are usually better ways to deal with short sleeps. For example, polling a file at 1ms will still often miss most updates to the file (as most updates to it won't change the metadata of the file), in addition to wasting a lot of computer time, that a call to the kernel filewatching stdlib would have gotten correct more often.

oscardssmith commented 5 months ago

Do we need/want libuv support for short sleeps? I think it would be reasonable to make sleeps less than 1ms be a yield loop even if libuv supported short sleeps.

MilesCranmer commented 5 months ago

I think if the operating system makes it possible to sleep for a certain interval under a millisecond, Julia should make that function available to the user without added overhead, blocking, or CPU usage.

MilesCranmer commented 5 months ago

contrary to the python documentation, but per the kernel documentation linked to by the python documentation as its citation), the minimum possible sleep request is 1ms, and often it will round-up significantly more (usually up to the nearest 15ms interval).

I think you might be reading the wrong page in the kernel docs? You can indeed get lower than 1ms resolution on Windows, I just verified it:

Screenshot 2024-06-29 at 12 43 48
oscardssmith commented 5 months ago

The question with the python implementation is what happens if you request a sleep for (say) 5e-7 seconds. It seems very plausible that python turns sleep 0 into a noop,

MilesCranmer commented 5 months ago

I'm not sure 5e-7 is relevant, when even 1e-4 is out of the picture at the moment for Julia?

Anyways, in Python, 5e-7 is about 4x slower than expected (on macOS; currently away from my Windows machine) –

[nav] In [2]: import time

[ins] In [3]: %%timeit
         ...: time.sleep(5e-7)
         ...: 
         ...: 
1.91 µs ± 18.5 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

I think even 1e-4 or 1e-5 sleeps would be nice in Julia, though, if 1e-6 isn't yet reachable, as the current status is (also on macOS) –

julia> @btime sleep(5e-7)
  1.145 ms (4 allocations: 112 bytes)
MilesCranmer commented 5 months ago

For reference, these are the main functions used by the python sleep code on Windows:

The docs for the second function mention:

The time after which the state of the timer is to be set to signaled, in 100 nanosecond intervals. Use the format described by the FILETIME structure. Positive values indicate absolute time. Be sure to use a UTC-based absolute time, as the system uses UTC-based time internally. Negative values indicate relative time. The actual timer accuracy depends on the capability of your hardware. For more information about UTC-based time, see System Time.

So it sounds like you can "request" sleep times in 100 ns intervals, but the actual accuracy is hardware-dependent.

The code shows that there's also an early-quit branch if sleep of 0 is passed – which calls Sleep(0) on Windows. So I guess on my machine, noting the 1e-6 example I gave above, ~20us is probably the lowest it can go on the hardware. (Also including the overhead of the other Python stuff in that call.)

In any case it would be nice to have access to this :)

vtjnash commented 5 months ago

Interesting. It seems that as of 2022, Python developers could prove in testing that the code didn't work (instead giving 16ms resolution) per https://bugs.python.org/issue45429

vtjnash commented 5 months ago

Also, just to point out the obvious: 1ms is already 1000 frames per second and therefore already significantly more than the OS checks for new keystrokes

MilesCranmer commented 5 months ago

I'm reading that thread now and they say CREATE_WAITABLE_TIMER_HIGH_RESOLUTION was the trick they used to fix it, which is used here: https://github.com/python/cpython/blob/81a654a3425eaa05a51342509089533c1f623f1b/Modules/timemodule.c#L447-L449

vtjnash commented 5 months ago

Ah, I see now where in the thread they mention this was implemented in the NT kernel in late 2021. I had not know that, as the documentation seems to not be fully updated for it

oscardssmith commented 5 months ago

Also, just to point out the obvious: 1ms is already 1000 frames per second and therefore already significantly more than the OS checks for new keystrokes

I don't think this is quite right. As I understand it, 1000hz polling rates were fairly common for gaming mice/keyboards.

vtjnash commented 5 months ago

Not with a stock kernel, though you can compile your own to increase it up to 1khz eg https://superuser.com/questions/369826/increase-usb-polling-rate-across-all-devices-in-linux