Open nbdd0121 opened 5 years ago
Perhaps we could reuse Spinlock
from crossbeam-channel
?
I'd accept a PR that switches from Mutex
to Spinlock
.
I did a few experiments, and I think I got an idea that can reduce async_std::sync::Mutex
to the size of usize and makes it allocation-free when there isn't contention.
Here's the brief list of my plan. If you think it looks alright, I'll proceed to implementation and benchmark the performance differences.
Step 1: Replace Slab with a double linked list
blocked
to 16 bytes (from 40 bytes for slab) plus Mutex overhead.Step 2: Replace doubly linked list with single linked list
blocked
(which will be the linked list head pointer) to 8 bytes plus Mutex overhead.Step 3: Pack everything into an AtomicUsize
WakerList
implemented in this patch series useful for others, such as RwLock/Condvar.After all steps the size of Mutex<()>
will reduce from 64 bytes to 816 bytes, free to construct and destruct.
Do we allocate nodes for the linked list on the heap? If so, I am worried that these allocations will be slower than using the slab.
Another possibility is to lazily allocate the slab on the heap only when it is actually needed (i.e. there is contention on the async mutex). We could even put the address of the slab into state
while using the lowest 2-3 bits for other stuff.
The linked list nodes are allocated on the heap, for the benefits of a stable address. Linked list has an extra benefit of guaranteed O(1)
insertion/removal (slab uses Vec
, which is amortised O(1)
), and O(1)
to get an item from the list (get an item from slab have O(n)
complexity). Slab also does not shrink itself (and it is difficult for us to decide when to shrink them), so the worse case space complexity will be O(num of task * num of Mutexes)
. Linked lists have a worse case space complexity of O(num of tasks)
.
From my initial experiments, performance actually goes up slightly by using linked lists, despite the allocation.
Interesting! Okay, I'm convinced :) On another thought, there must be tricks for optimizing allocations in the typical case so I'm not worried as much anymore...
I'd be happy to review a PR that replaces slab
with linked lists.
Also, it would be great to have your experiments as microbenchmarks using #[bench]
.
I did a few experiments, and I think I got an idea that can reduce async_std::sync::Mutex to the size of usize and makes it allocation-free when there isn't contention.
Do we allocate nodes for the linked list on the heap? If so, I am worried that these allocations will be slower than using the slab.
The linked list nodes are allocated on the heap, for the benefits of a stable address. Linked list has an extra benefit of guaranteed
O(1)
insertion/removal (slab usesVec
, which is amortisedO(1)
), andO(1)
to get an item from the list (get an item from slab haveO(n)
complexity). Slab also does not shrink itself (and it is difficult for us to decide when to shrink them), so the worse case space complexity will beO(num of task * num of Mutexes)
. Linked lists have a worse case space complexity ofO(num of tasks)
.From my initial experiments, performance actually goes up slightly by using linked lists, despite the allocation.
If you just use futures_intrusive::sync::Mutex you get your parking-lot backed small mutex without any need for heap allocations - and even without the need to write the code yourself :-)
Perhaps we could reuse
Spinlock
fromcrossbeam-channel
?I'd accept a PR that switches from
Mutex
toSpinlock
.
Btw: I would be super careful with pure Spinlocks in userspace. If the OS schedules away the thread which holds the lock you will start wasting a ton of resources and enter the wonderful world of priority inversions. IMHO they are a tool for Kernels and embedded applications. For normal userspace applications one should fallback to a Kernel wait primitive after some spins, or just avoid them in general.
If you just use futures_intrusive::sync::Mutex you get your parking-lot backed small mutex without any need for heap allocations - and even without the need to write the code yourself :-)
I looked at the code, I think the basic idea is the same. However using that will pull in a large amount of dependencies, which this project tries to avoid. My PR also does better than the code you refers to, so I don't see the point switching to that.
Yes, it's literally a HUGE amount of dependencies 😀
Btw: I would be super careful with pure Spinlocks in userspace. If the OS schedules away the thread which holds the lock you will start wasting a ton of resources and enter the wonderful world of priority inversions. IMHO they are a tool for Kernels and embedded applications. For normal userspace applications one should fallback to a Kernel wait primitive after some spins, or just avoid them in general.
That's true, and my benchmarks show that naively changing everything to spinlock causes 4x performance degradation. However after changing from Slab to a linked list, we only spend very small amount of time in the critical region, and benchmark shows a similar (and slightly better) performance when compared with the current state.
FYI, crossbeam's Spinlock also has a back off policy which will yield to other threads if it cannot acquire the lock in a few spins.
Yes, it's literally a HUGE amount of dependencies 😀
If you see #252 you will notice that parking-lot is explicitly removed as a dependency from this project.
Reducing deps is all well and good, but it doesn't make much sense to end up writing and maintaining equivalent code here when the potential dependency is of good quality and maintained.
For reference, these are our current async_std::sync::Mutex
benchmarks:
test contention ... bench: 893,650 ns/iter (+/- 44,336)
test create ... bench: 4 ns/iter (+/- 0)
test no_contention ... bench: 386,525 ns/iter (+/- 368,903)
This is futures_intrusive::sync::Mutex
with default Cargo options and with is_fair
set to false
:
test contention ... bench: 1,968,689 ns/iter (+/- 303,900)
test create ... bench: 8 ns/iter (+/- 0)
test no_contention ... bench: 431,264 ns/iter (+/- 423,020)
This is tokio::sync::Mutex
:
test contention ... bench: 2,614,997 ns/iter (+/- 167,533)
test create ... bench: 24 ns/iter (+/- 6)
test no_contention ... bench: 516,801 ns/iter (+/- 139,907)
This is futures::lock::Mutex
:
test contention ... bench: 1,747,920 ns/iter (+/- 149,184)
test create ... bench: 38 ns/iter (+/- 1)
test no_contention ... bench: 315,463 ns/iter (+/- 280,223)
There is some variance to these numbers between consecutive runs, though.
@stjepang Are the sources of these benchmarks available somewhere?
Is it the ones of #370? If yes then I think the "contention" benchmarks do only a part of what the name says. It is only about contention within a single-thread, but cross-thread synchronization will never have an impact. In that kind of benchmark likely futures_intrusive::sync::LocalMutex
would be blazing fast due to eliding synchronization.
Edit: Oh, now I see they are here
If I'm reading the code correctly, it looks like locking Tokio's Mutex
requires no heap allocations even when it's contended.
The idea is that the future returned by Mutex::lock
can itself be a member of an intrusive linked list of tasks waiting for the mutex. It seems like only the waker from the most recent poll of a given lock
future is retained, which seems odd to me, but I could be misreading.
@stjepang Can you add in the benchmarks using your async-lock
crate as well? I know that you've got two different crates (async-mutex
and async-lock
, which appear to be forks of one another), and that this crate uses async-mutex
, but I wanted to know if you've made some speed ups in one vs. the other. Basically, I want to know which of those crates I should be using in the future.
I've noticed that async-std uses
std::sync::Mutex
in quite a few places. It's well known that parking_lot offers much better performance, and more noticeably space saving, compared to the std mutex, thus enabling very fine-grained locks.The current
async_std::sync::Mutex
and other few places of async_std seems to be usingstd::sync::Mutex
, which has quite a few allocations and is very heavy-weight. Is there any existing discussion about performance and code bloat impact of switching over to parking_lot?There is also a probability that we can use ideas similar to parking_lot to create a very cheap and light-weight (i.e. only a few bytes, no extra allocation) async Mutex that optimises for non-contended cases.