Closed orisano closed 3 years ago
Thanks for the interesting suggestion. I haven't thought to switch the standard shard mutex to third party's. I myself ran your code on Linux (Intel Core i7-8550U 1.8 GHz) and MacOS (Apple M1 3.2 GHz). I stored 100,000,000 records with 4 threads (25,000,000 each).
On Linux, the original 1,567,278 QPS became 1,810,241 QPS. So, 1.15x faster. On MacOS, the original 919,984 QPS became 2,872,685 QPS. So, 3.12x faster.
The diference on Linux is less than on MacOS. Still, 3.12x boost is amaging and 1.15x boot is non-negligible.
Thus, I should seriously consider this direction. Probably, I'll add a configure switch to enable injection of third party mutexes.
=== logs === On Linux vanilla-tkrzw: $ ./tkrzw_dbm_perf sequence --path casket.tkh --buckets 200000000 --iter 25000000 --threads 4 --set_only (snip) Setting done: elapsed_time=63.804888 num_records=100000000 qps=1567278 mem=3128264000 file_size=3200002048 eff_data_size=1600000000 efficiency=50.00% num_buckets=200000033 load_factor=0.50
On Linux tkrzw-feat-use-contention-free-shared-mutex: $ ./tkrzw_dbm_perf sequence --path casket.tkh --buckets 200000000 --iter 25000000 --threads 4 --set_only (snip) Setting done: elapsed_time=55.241269 num_records=100000000 qps=1810241 mem=3128132000 file_size=3200002048 eff_data_size=1600000000 efficiency=50.00% num_buckets=200000033 load_factor=0.50
On MacOS vanilla-tkrzw: $ ./tkrzw_dbm_perf sequence --path casket.tkh --iter 25000000 --buckets 200000000 --threads 4 --set_only (snip) Synchronizing: ... done (elapsed=0.000573) Setting done: elapsed_time=108.697516 num_records=100000000 qps=919984 mem=3277328482304 file_size=3200002048 eff_data_size=1600000000 efficiency=50.00% num_buckets=200000033 load_factor=0.50
On MacOS tkrzw-feat-use-contention-free-shared-mutex setagaya[~/Downloads/]$ ./tkrzw_dbm_perf sequence --path casket.tkh --iter 25000000 --buckets 200000000 --threads 4 --set_only (snip) Synchronizing: ... done (elapsed=0.000352) Setting done: elapsed_time=34.810642 num_records=100000000 qps=2872685 mem=3277362036736 file_size=3200002048 eff_data_size=1600000000 efficiency=50.00% num_buckets=200000033 load_factor=0.50
Tkrzw 0.9.50 has been released. It uses my own spin lock implementation, which boasts comparable performance with object_threadsiave. As a result, the performance is 3x faster on MacOS and 1.3x faster on Linux. It also affects other languages including Go.
Thanks!
Can we make it use a third party mutex?
Got a profile to find the bottleneck of tkrzw-go. It seems that shared_mutex is the bottleneck. I experimented with a third party mutex and found that it was 6 times faster.
std::shared_timed_mutex
use sf::contention_free_shared_mutex (here)
Environment