Can we make it use a third party mutex?

orisano commented 3 years ago

Got a profile to find the bottleneck of tkrzw-go. It seems that shared_mutex is the bottleneck. I experimented with a third party mutex and found that it was 6 times faster.

std::shared_timed_mutex

./perf --path casket.tkh --params "num_buckets=100000" --iter 2000000 --threads 5
path: casket.tkh
params: num_buckets=100000,truncate=true
num_iterations: 2000000
num_threads: 5
is_random: false

Setting:
.................................................. (00200000)
.................................................. (00400000)
.................................................. (00600000)
.................................................. (00800000)
.................................................. (01000000)
.................................................. (01200000)
.................................................. (01400000)
.................................................. (01600000)
.................................................. (01800000)
.................................................. (02000000)
Setting done: num_records=10000000 file_size=240401408 time=379.159 qps=26374 mem=255458279424

Getting:
.................................................. (00200000)
.................................................. (00400000)
.................................................. (00600000)
.................................................. (00800000)
.................................................. (01000000)
.................................................. (01200000)
.................................................. (01400000)
.................................................. (01600000)
.................................................. (01800000)
.................................................. (02000000)
Getting done: num_records=10000000 file_size=240401408 time=382.246 qps=26161 mem=255621857280

Removing:
.................................................. (00200000)
.................................................. (00400000)
.................................................. (00600000)
.................................................. (00800000)
.................................................. (01000000)
.................................................. (01200000)
.................................................. (01400000)
.................................................. (01600000)
.................................................. (01800000)
.................................................. (02000000)
Removing done: num_records=0 file_size=240401408 time=386.244 qps=25890 mem=255751880704

use sf::contention_free_shared_mutex (here)

$ ./perf --path casket.tkh --params "num_buckets=100000" --iter 2000000 --threads 5
path: casket.tkh
params: num_buckets=100000,truncate=true
num_iterations: 2000000
num_threads: 5
is_random: false

Setting:
.................................................. (00200000)
.................................................. (00400000)
.................................................. (00600000)
.................................................. (00800000)
.................................................. (01000000)
.................................................. (01200000)
.................................................. (01400000)
.................................................. (01600000)
.................................................. (01800000)
.................................................. (02000000)
Setting done: num_records=10000000 file_size=240401408 time=58.734 qps=170260 mem=255558942720

Getting:
.................................................. (00200000)
.................................................. (00400000)
.................................................. (00600000)
.................................................. (00800000)
.................................................. (01000000)
.................................................. (01200000)
.................................................. (01400000)
.................................................. (01600000)
.................................................. (01800000)
.................................................. (02000000)
Getting done: num_records=10000000 file_size=240401408 time=53.676 qps=186304 mem=255693160448

Removing:
.................................................. (00200000)
.................................................. (00400000)
.................................................. (00600000)
.................................................. (00800000)
.................................................. (01000000)
.................................................. (01200000)
.................................................. (01400000)
.................................................. (01600000)
.................................................. (01800000)
.................................................. (02000000)
Removing done: num_records=0 file_size=240401408 time=54.733 qps=182706 mem=255831572480

Environment

  Model Name:   MacBook Pro
  Model Identifier: MacBookPro16,2
  Processor Name:   Quad-Core Intel Core i7
  Processor Speed:  2.3 GHz
  Number of Processors: 1
  Total Number of Cores:    4
  L2 Cache (per Core):  512 KB
  L3 Cache: 8 MB
  Hyper-Threading Technology:   Enabled
  Memory:   32 GB

estraier commented 3 years ago

Thanks for the interesting suggestion. I haven't thought to switch the standard shard mutex to third party's. I myself ran your code on Linux (Intel Core i7-8550U 1.8 GHz) and MacOS (Apple M1 3.2 GHz). I stored 100,000,000 records with 4 threads (25,000,000 each).

On Linux, the original 1,567,278 QPS became 1,810,241 QPS. So, 1.15x faster. On MacOS, the original 919,984 QPS became 2,872,685 QPS. So, 3.12x faster.

The diference on Linux is less than on MacOS. Still, 3.12x boost is amaging and 1.15x boot is non-negligible.

Thus, I should seriously consider this direction. Probably, I'll add a configure switch to enable injection of third party mutexes.

=== logs === On Linux vanilla-tkrzw: $ ./tkrzw_dbm_perf sequence --path casket.tkh --buckets 200000000 --iter 25000000 --threads 4 --set_only (snip) Setting done: elapsed_time=63.804888 num_records=100000000 qps=1567278 mem=3128264000 file_size=3200002048 eff_data_size=1600000000 efficiency=50.00% num_buckets=200000033 load_factor=0.50

On Linux tkrzw-feat-use-contention-free-shared-mutex: $ ./tkrzw_dbm_perf sequence --path casket.tkh --buckets 200000000 --iter 25000000 --threads 4 --set_only (snip) Setting done: elapsed_time=55.241269 num_records=100000000 qps=1810241 mem=3128132000 file_size=3200002048 eff_data_size=1600000000 efficiency=50.00% num_buckets=200000033 load_factor=0.50

On MacOS vanilla-tkrzw: $ ./tkrzw_dbm_perf sequence --path casket.tkh --iter 25000000 --buckets 200000000 --threads 4 --set_only (snip) Synchronizing: ... done (elapsed=0.000573) Setting done: elapsed_time=108.697516 num_records=100000000 qps=919984 mem=3277328482304 file_size=3200002048 eff_data_size=1600000000 efficiency=50.00% num_buckets=200000033 load_factor=0.50

On MacOS tkrzw-feat-use-contention-free-shared-mutex setagaya[~/Downloads/]$ ./tkrzw_dbm_perf sequence --path casket.tkh --iter 25000000 --buckets 200000000 --threads 4 --set_only (snip) Synchronizing: ... done (elapsed=0.000352) Setting done: elapsed_time=34.810642 num_records=100000000 qps=2872685 mem=3277362036736 file_size=3200002048 eff_data_size=1600000000 efficiency=50.00% num_buckets=200000033 load_factor=0.50

estraier commented 3 years ago

Tkrzw 0.9.50 has been released. It uses my own spin lock implementation, which boasts comparable performance with object_threadsiave. As a result, the performance is 3x faster on MacOS and 1.3x faster on Linux. It also affects other languages including Go.

orisano commented 3 years ago

Thanks!

estraier / tkrzw

Can we make it use a third party mutex? #21

Environment