low benchmark and cpu bottleneck issue

dista commented 7 months ago

We have a video/audio streaming application build on tokio(which enable parking_lot by default), when parking_lot is enabled, when we use wrk to bench http output of the streaming application, the application is bottlenecked in cpu, no matter how many threads(32 threads for example, it should reach 3200% at most) we assign to tokio, the cpu of our application can not exceed 600%.

after disable parking_lot, we can reach the number we anticipated.

the benchmark for parking_lot is very pool in our server(AMD EPYC 7502, Rocky linux 9.3, Kernel 5.14.0-362.8.1.el9_3.x86_64) cargo run --release 32 2 10000 100

std::sync::Mutex     avg 24.545373ms  min 19.094168ms  max 27.440505ms 
parking_lot::Mutex   avg 437.688227ms min 403.22461ms  max 493.899443ms
spin::Mutex          avg 32.515297ms  min 22.190278ms  max 56.560499ms 
AmdSpinlock          avg 32.379324ms  min 23.217082ms  max 48.166241ms

system information:

CPU

processor       : 63
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 49
model name      : AMD EPYC 7502 32-Core Processor
stepping        : 0
microcode       : 0x830107a
cpu MHz         : 2500.000
cache size      : 512 KB
physical id     : 1
siblings        : 32
core id         : 31
cpu cores       : 32
apicid          : 95
initial apicid  : 95
fpu             : yes
fpu_exception   : yes
cpuid level     : 16
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
bugs            : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb
bogomips        : 4988.92
TLB size        : 3072 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 44 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

memory: 250G

numactl --show

policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 
cpubind: 0 1 
nodebind: 0 1 
membind: 0 1 
preferred:

pingzhaozz commented 7 months ago

I'm interested in this which seems similar issue I met before. Does this PRhttps://github.com/Amanieu/parking_lot/pull/419 fix your problem?

dista commented 7 months ago

I'm interested in this which seems similar issue I met before. Does this PR#419 fix your problem?

currently I just simply disable parking_lot in tokio. I run your PR bench in my machine. cargo run --release 32 2 10000 100

std::sync::Mutex     avg 18.299529ms  min 17.132839ms  max 21.931748ms 
parking_lot::Mutex   avg 16.937637ms  min 14.974131ms  max 19.934233ms 
spin::Mutex          avg 30.254703ms  min 12.640727ms  max 54.775686ms 
AmdSpinlock          avg 31.261368ms  min 14.366787ms  max 61.797515ms

but honestly I do not like the idea of thread::sleep(1ms), I think it maybe hurt performance in other way

pingzhaozz commented 7 months ago

thread::sleep(1ms) only happens after spin, cpu_relax, thread_yield all failed which means a really heavy cacheline contention there. Therefore, the thread should "sleep" to avoid busy waiting.

dista commented 7 months ago

but the choise of 1ms seems have no specific reason. why choose 1ms, not 0.5ms.

Amanieu / parking_lot

low benchmark and cpu bottleneck issue #437