async_std::sync::Mutex is too heavy

nbdd0121 commented 5 years ago

I've noticed that async-std uses std::sync::Mutex in quite a few places. It's well known that parking_lot offers much better performance, and more noticeably space saving, compared to the std mutex, thus enabling very fine-grained locks.

The current async_std::sync::Mutex and other few places of async_std seems to be using std::sync::Mutex, which has quite a few allocations and is very heavy-weight. Is there any existing discussion about performance and code bloat impact of switching over to parking_lot?

There is also a probability that we can use ideas similar to parking_lot to create a very cheap and light-weight (i.e. only a few bytes, no extra allocation) async Mutex that optimises for non-contended cases.

ghost commented 5 years ago

Perhaps we could reuse Spinlock from crossbeam-channel?

https://github.com/crossbeam-rs/crossbeam/blob/30f5d8b0a0cd6ea5923b771ce6c75834a1e44bec/crossbeam-channel/src/utils.rs#L64-L112

I'd accept a PR that switches from Mutex to Spinlock.

nbdd0121 commented 5 years ago

I did a few experiments, and I think I got an idea that can reduce async_std::sync::Mutex to the size of usize and makes it allocation-free when there isn't contention.

Here's the brief list of my plan. If you think it looks alright, I'll proceed to implementation and benchmark the performance differences.

Step 1: Replace Slab with a double linked list

insertion/removal into/from the slab is replaced with linked list insertion/removal.
replace the item in the slab is replaced with a direct write to the list node.
benefits: slab does not shrink itself, while linked list is. as we don't need to know the length of waker list, only if it's empty, we reduce size requirement of blocked to 16 bytes (from 40 bytes for slab) plus Mutex overhead.

Step 2: ~~Replace doubly linked list with single linked list~~

disallow non-list-head to acquire the lock. Note that the current unlock code only wakes the first item in the iterator, so it won't deadlock. This will actually give the Mutex a bit of extra fairness.
now we no longer need to remove arbitrary elements from the queue, so we can change doubly linked list to single linked list, reduce blocked (which will be the linked list head pointer) to 8 bytes plus Mutex overhead.
Edit: as we need to remove the waker when the future is dropped, we still need a doubly linked list, but we can still use circular linked list to reduce size to one usize.

Step 3: ~~Pack everything into an AtomicUsize~~

~~we can use the lowest 2-3 bits of the linked list head pointer to replace the LOCKED, BLOCKED and the Mutex, so packing everything into the same AtomicUsize.~~
Edit: I decided not to pack everything into the same AtomicUsize, mainly for code readability and make the changes easier to review. It makes WakerList implemented in this patch series useful for others, such as RwLock/Condvar.

After all steps the size of Mutex<()> will reduce from 64 bytes to 816 bytes, free to construct and destruct.

ghost commented 5 years ago

Do we allocate nodes for the linked list on the heap? If so, I am worried that these allocations will be slower than using the slab.

Another possibility is to lazily allocate the slab on the heap only when it is actually needed (i.e. there is contention on the async mutex). We could even put the address of the slab into state while using the lowest 2-3 bits for other stuff.

nbdd0121 commented 5 years ago

The linked list nodes are allocated on the heap, for the benefits of a stable address. Linked list has an extra benefit of guaranteed O(1) insertion/removal (slab uses Vec, which is amortised O(1)), and O(1) to get an item from the list (get an item from slab have O(n) complexity). Slab also does not shrink itself (and it is difficult for us to decide when to shrink them), so the worse case space complexity will be O(num of task * num of Mutexes). Linked lists have a worse case space complexity of O(num of tasks).

From my initial experiments, performance actually goes up slightly by using linked lists, despite the allocation.

ghost commented 5 years ago

Interesting! Okay, I'm convinced :) On another thought, there must be tricks for optimizing allocations in the typical case so I'm not worried as much anymore...

I'd be happy to review a PR that replaces slab with linked lists.

Also, it would be great to have your experiments as microbenchmarks using #[bench].

Matthias247 commented 5 years ago

I did a few experiments, and I think I got an idea that can reduce async_std::sync::Mutex to the size of usize and makes it allocation-free when there isn't contention.

Do we allocate nodes for the linked list on the heap? If so, I am worried that these allocations will be slower than using the slab.

The linked list nodes are allocated on the heap, for the benefits of a stable address. Linked list has an extra benefit of guaranteed O(1) insertion/removal (slab uses Vec, which is amortised O(1)), and O(1) to get an item from the list (get an item from slab have O(n) complexity). Slab also does not shrink itself (and it is difficult for us to decide when to shrink them), so the worse case space complexity will be O(num of task * num of Mutexes). Linked lists have a worse case space complexity of O(num of tasks).

From my initial experiments, performance actually goes up slightly by using linked lists, despite the allocation.

If you just use futures_intrusive::sync::Mutex you get your parking-lot backed small mutex without any need for heap allocations - and even without the need to write the code yourself :-)

Matthias247 commented 5 years ago

Perhaps we could reuse Spinlock from crossbeam-channel?

https://github.com/crossbeam-rs/crossbeam/blob/30f5d8b0a0cd6ea5923b771ce6c75834a1e44bec/crossbeam-channel/src/utils.rs#L64-L112

I'd accept a PR that switches from Mutex to Spinlock.

Btw: I would be super careful with pure Spinlocks in userspace. If the OS schedules away the thread which holds the lock you will start wasting a ton of resources and enter the wonderful world of priority inversions. IMHO they are a tool for Kernels and embedded applications. For normal userspace applications one should fallback to a Kernel wait primitive after some spins, or just avoid them in general.

nbdd0121 commented 5 years ago

If you just use futures_intrusive::sync::Mutex you get your parking-lot backed small mutex without any need for heap allocations - and even without the need to write the code yourself :-)

I looked at the code, I think the basic idea is the same. However using that will pull in a large amount of dependencies, which this project tries to avoid. My PR also does better than the code you refers to, so I don't see the point switching to that.

Matthias247 commented 5 years ago

Yes, it's literally a HUGE amount of dependencies 😀

nbdd0121 commented 5 years ago

Btw: I would be super careful with pure Spinlocks in userspace. If the OS schedules away the thread which holds the lock you will start wasting a ton of resources and enter the wonderful world of priority inversions. IMHO they are a tool for Kernels and embedded applications. For normal userspace applications one should fallback to a Kernel wait primitive after some spins, or just avoid them in general.

That's true, and my benchmarks show that naively changing everything to spinlock causes 4x performance degradation. However after changing from Slab to a linked list, we only spend very small amount of time in the critical region, and benchmark shows a similar (and slightly better) performance when compared with the current state.

FYI, crossbeam's Spinlock also has a back off policy which will yield to other threads if it cannot acquire the lock in a few spins.

nbdd0121 commented 5 years ago

Yes, it's literally a HUGE amount of dependencies 😀

If you see #252 you will notice that parking-lot is explicitly removed as a dependency from this project.

jbg commented 5 years ago

Reducing deps is all well and good, but it doesn't make much sense to end up writing and maintaining equivalent code here when the potential dependency is of good quality and maintained.

ghost commented 5 years ago

For reference, these are our current async_std::sync::Mutex benchmarks:

test contention    ... bench:     893,650 ns/iter (+/- 44,336)
test create        ... bench:           4 ns/iter (+/- 0)
test no_contention ... bench:     386,525 ns/iter (+/- 368,903)

This is futures_intrusive::sync::Mutex with default Cargo options and with is_fair set to false:

test contention    ... bench:   1,968,689 ns/iter (+/- 303,900)
test create        ... bench:           8 ns/iter (+/- 0)
test no_contention ... bench:     431,264 ns/iter (+/- 423,020)

This is tokio::sync::Mutex:

test contention    ... bench:   2,614,997 ns/iter (+/- 167,533)
test create        ... bench:          24 ns/iter (+/- 6)
test no_contention ... bench:     516,801 ns/iter (+/- 139,907)

This is futures::lock::Mutex:

test contention    ... bench:   1,747,920 ns/iter (+/- 149,184)
test create        ... bench:          38 ns/iter (+/- 1)
test no_contention ... bench:     315,463 ns/iter (+/- 280,223)

There is some variance to these numbers between consecutive runs, though.

Matthias247 commented 5 years ago

@stjepang Are the sources of these benchmarks available somewhere?

Is it the ones of #370? If yes then I think the "contention" benchmarks do only a part of what the name says. It is only about contention within a single-thread, but cross-thread synchronization will never have an impact. In that kind of benchmark likely futures_intrusive::sync::LocalMutex would be blazing fast due to eliding synchronization.

Edit: Oh, now I see they are here

jimblandy commented 4 years ago

If I'm reading the code correctly, it looks like locking Tokio's Mutex requires no heap allocations even when it's contended.

The idea is that the future returned by Mutex::lock can itself be a member of an intrusive linked list of tasks waiting for the mutex. It seems like only the waker from the most recent poll of a given lock future is retained, which seems odd to me, but I could be misreading.

ckaran commented 3 years ago

@stjepang Can you add in the benchmarks using your async-lock crate as well? I know that you've got two different crates (async-mutex and async-lock, which appear to be forks of one another), and that this crate uses async-mutex, but I wanted to know if you've made some speed ups in one vs. the other. Basically, I want to know which of those crates I should be using in the future.

async-rs / async-std

async_std::sync::Mutex is too heavy #362