Mutex performance drop in case of cache contention

There's a problem of parking-lot Mutex when met cache line contention, the performance dropped much. It can also be observed with lock-bench

$ cargo run --release 32 2 10000 100
Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 113.030539ms min 105.760154ms max 131.87258ms
parking_lot::Mutex   avg 403.756547ms min 326.026509ms max 533.260014ms
spin::Mutex          avg 161.125953ms min 151.708034ms max 177.132377ms
AmdSpinlock          avg 158.042233ms min 148.723058ms max 171.265994ms

It's observed on INTEL 120 cores CPU.

Debug shows currently spin() strategy will enter Parking state after "spinwait" failure. The Parking mechanism introduces an overhead of approximately ~100ms(from lock-bench data). In case of cache contention in multi-core, multi-thread scenarios, the likelihood of spin() failures is significantly higher, leading to longer lock durations(there's another lock in parking thread list which may meet cache contention too). Considering the ~100ms overhead of parking and the millisecond-level or even lower <1ms durations of spin, there needs to be a buffer transition between them to avoid the performance loss caused by frequent entries into the parking state.

Currently yield is used in spinwait. When there are multiple threads running on the same core, yield can effectively alleviate contention problem. However, when the scheduler's ready queue contains only the current thread, the yield effect is minimal and may not effective . A possible way is adding some sleep before parking and after spin(). The result shows good which is better than using mm_pause or yield. It improves the lock bench much and shows better cpu utility. I'll submit a PR later.

Sleep 1ms before parking:

$cargo run --release 32 2 10000 100
Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 113.276158ms min 103.870893ms max 131.823024ms
parking_lot::Mutex   avg 81.669426ms  min 72.584055ms  max 88.3535ms
spin::Mutex          avg 161.586476ms min 152.302867ms max 184.132674ms
AmdSpinlock          avg 157.446488ms min 147.091038ms max 180.832205ms

Amanieu / parking_lot

Mutex performance drop in case of cache contention #418