Closed dumblob closed 3 years ago
It's great that you want to help me improve something in libmdbx. Nonetheless, if you want to play these games with useful, then you should (not be naive, but) look deeper and more carefully into libmdbx.
The third case uses custom ad-hoc algo based on radixsort. This is a very cold code (i.e. it is executed extremely rarely), and the specifics handling is much more important than the speed of the sorting itself. I don't think that this can be improved, especially anything related to sorting, sine a structure of the algorithm is extremely close to the theoretical limit.
The first two cases uses:
CMOV
and VLIW;There are known nothing could "beat" radix sort for large chunks and optimal sort network (depth-optimal) for small ones.
So sorting in libmdbx can lose out only for specific chunks sizes if quadsort will be faster than my implementation based on quicksort with all the tricks used.
I will play with quadsort
(for instance with this simple benchmark of sorting).
However, (preliminary) for now I don't see any reasons to libmdbx's sorting be noticeably slower than any other sorting (in the libmdbx's internal use cases).
A recursive merge sort isn't used, but a single-pass merge (with only a single branch on modern CPUs) is used for lazy sort of unsorted tail of dirty page list. I.e. if we have a unsorted tail which is ≈four times less than sorted part of a list, and a whole list size is not reasonable for radix sort, then we sort the tail and merge it with sorted part at be begin of dirty pages list. This is known to be the best solution, provided that the thresholds (for switching to radix sort and using sorting of tail with merging instead of full sorting) are optimally selected. The optimal choice of such thresholds is quite a difficult task, since it fundamentally depends on the characteristics of a target platform, quality of compiler's optimization and the nature of a data (the ratio of the cost of comparison to movement ones). So we should hardcode some not-a-bad/suboptimal values, either run a benchmark to choice ones during configure the library build.
In addition to the above, it is worth noting that it is very likely that in libmdbx it is reasonable reducing the sorting-related thresholds from 16 to 12. For instance, seems the network sort width should be limited to 12 which reduces network deep from 9 to 8. This will reduce the size of the code, but (presumably) will not have a noticeable effect on the sorting speed.
Wow, thanks a lot for taking the time and describing in depth how libmdbx approaches sorting as of now.
Yes I've looked at the code just extremely briefly and this was the first idea which popped in my head. Of course it's all about real benchmarks and not synthetic ones. I just wanted to share those links as I find them kind of matching your goals to build a very fast DB.
Feel free to close this if you think there are no actionable items (though I'd very much like to see some real measurement results if you'll try e.g. quadsort or faster than usual bitonic sorters bubble sort for small inputs in libmdbx).
... faster than usual bitonic sorters bubble sort for small inputs ...
Just for clarity:
MDBX currently uses well-known optimal sorting networks (with the 3-16 entries), which are targeting CPU with large register file and many of execution units. This is good for E2K (256 registers, 8 ops/clk) and not-bad fpr modern x86_64 CPUs (32-64 registers, 4-6 ops/clk).
This is good for E2K (256 registers, 8 ops/clk) and not-bad fpr modern x86_64 CPUs (32-64 registers, 4-6 ops/clk).
Interesting - didn't know libmdbx is being tested on Elbrus CPUs. Does it run somewhere in production? If you have any links, please post them, thanks!
No significant differences. It just works.
user@elbrus ~ $ lscpu
Архитектура: e2k
Порядок байт: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Ядер на сокет: 8
Сокетов: 1
NUMA node(s): 1
ID прроизводителя: Elbrus-MCST
Семейство ЦПУ: 5
Модель: 2
Имя модели: E8C2
CPU MHz: 1500
BogoMIPS: 3000.00
L1d cache: 64K
L1i cache: 128K
L2 cache: 512K
L3 cache: 16384K
L3 cache: 16384K
NUMA node0 CPU(s): 0-7
user@elbrus ~ $ cc --version
lcc:1.25.17:May-16-2021:e2k-v5-linux
gcc (GCC) 7.3.0 compatible
user@elbrus ~ $ cd libmdbx/
user@elbrus libmdbx $ make test
// TIP: Use `make V=1` for verbose.
// TIP: Clone and build the https://github.com/pmwkaa/ioarena.git within a neighbouring directory for availability of benchmarking.
MDBX_BUILD_OPTIONS = -DNDEBUG=1
MDBX_BUILD_TIMESTAMP = 2021-09-06T14:29:12+0300
// TIP: Use `make options` to listing available build options.
CC =/usr/bin/cc | lcc:1.25.17:May-16-2021:e2k-v5-linux
CFLAGS =-std=gnu11 -O2 -g -Wall -Werror -Wextra -Wpedantic -ffunction-sections -fPIC -fvisibility=hidden -pthread -Wno-error=attributes
CXXFLAGS =-std=gnu++2a -O2 -g -Wall -Werror -Wextra -Wpedantic -ffunction-sections -fPIC -fvisibility=hidden -pthread -Wno-error=attributes
LDFLAGS =-Wl,--gc-sections,-z,relro,-O1 -lm -lrt -pthread
// TIP: Use `make help` to listing available targets.
CLEAN for build with specified flags... Ok
MAKE src/version.c
MAKE src/config.h
CC mdbx-static.o
CC mdbx++-static.o
AR libmdbx.a
CC mdbx-dylib.o
CC mdbx++-dylib.o
LD libmdbx.so
CC+LD mdbx_stat
CC+LD mdbx_copy
CC+LD mdbx_dump
CC+LD mdbx_load
CC+LD mdbx_chk
CC+LD mdbx_drop
CC+LD mdbx_example
CC test/osal-unix.o
CC test/hill.o
CC test/keygen.o
CC test/copy.o
CC test/nested.o
CC test/jitter.o
CC test/main.o
CC test/test.o
CC test/cases.o
CC test/try.o
CC test/ttl.o
CC test/config.o
CC test/append.o
CC test/utils.o
CC test/log.o
CC test/chrono.o
CC test/dead.o
LD mdbx_test
RUNNING `test/long_stochastic.sh --loops 2`...
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
...
user@elbrus libmdbx $
It seems the sorting algorithm currently used is some variant of merge sort. If someone uses sorted outputs a lot then the complexity and practical limitations of the sorting algorithm start to matter a lot (I didn't do any benchmarking so I can't tell how big the difference will be in case of libdbmx).
There is Quadsort which is even a tiny bit faster than quicksort with identical algorithmic properties except for stability where Quadsort is stable.
For low-memory embedded systems there is a fixed-mem variant of Quadsort called Blitsort.
Btw. there are other gems like https://github.com/gerben-s/quicksort-blog-post/blob/master/blog.md but they're mostly not stable :cry:.