There's a problem of parking-lot Mutex when met cache line contention, the performance dropped much. It can also be observed with lock-bench
$ cargo run --release 32 2 10000 100
Options {
n_threads: 32,
n_locks: 2,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 113.030539ms min 105.760154ms max 131.87258ms
parking_lot::Mutex avg 403.756547ms min 326.026509ms max 533.260014ms
spin::Mutex avg 161.125953ms min 151.708034ms max 177.132377ms
AmdSpinlock avg 158.042233ms min 148.723058ms max 171.265994ms
It's observed on INTEL 120 cores CPU.
Debug shows currently spin() strategy will enter Parking state after "spinwait" failure. The Parking mechanism introduces an overhead of approximately ~100ms(from lock-bench data). In case of cache contention in multi-core, multi-thread scenarios, the likelihood of spin() failures is significantly higher, leading to longer lock durations(there's another lock in parking thread list which may meet cache contention too). Considering the ~100ms overhead of parking and the millisecond-level or even lower <1ms durations of spin, there needs to be a buffer transition between them to avoid the performance loss caused by frequent entries into the parking state.
Currently yield is used in spinwait. When there are multiple threads running on the same core, yield can effectively alleviate contention problem. However, when the scheduler's ready queue contains only the current thread, the yield effect is minimal and may not effective . A possible way is adding some sleep before parking and after spin(). The result shows good which is better than using mm_pause or yield. It improves the lock bench much and shows better cpu utility. I'll submit a PR later.
Sleep 1ms before parking:
$cargo run --release 32 2 10000 100
Options {
n_threads: 32,
n_locks: 2,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 113.276158ms min 103.870893ms max 131.823024ms
parking_lot::Mutex avg 81.669426ms min 72.584055ms max 88.3535ms
spin::Mutex avg 161.586476ms min 152.302867ms max 184.132674ms
AmdSpinlock avg 157.446488ms min 147.091038ms max 180.832205ms
There's a problem of parking-lot Mutex when met cache line contention, the performance dropped much. It can also be observed with lock-bench
It's observed on INTEL 120 cores CPU.
Debug shows currently
spin()
strategy will enter Parking state after "spinwait" failure. The Parking mechanism introduces an overhead of approximately~100ms
(from lock-bench data). In case of cache contention in multi-core, multi-thread scenarios, the likelihood ofspin()
failures is significantly higher, leading to longer lock durations(there's another lock in parking thread list which may meet cache contention too). Considering the~100ms
overhead of parking and the millisecond-level or even lower<1ms
durations of spin, there needs to be a buffer transition between them to avoid the performance loss caused by frequent entries into the parking state.Currently
yield
is used in spinwait. When there are multiple threads running on the same core,yield
can effectively alleviate contention problem. However, when the scheduler's ready queue contains only the current thread, theyield
effect is minimal and may not effective . A possible way is adding somesleep
before parking and afterspin()
. The result shows good which is better than usingmm_pause
oryield
. It improves the lock bench much and shows better cpu utility. I'll submit a PR later.Sleep 1ms before parking: