Sorting algorithms for even better performance

dumblob commented 3 years ago

It seems the sorting algorithm currently used is some variant of merge sort. If someone uses sorted outputs a lot then the complexity and practical limitations of the sorting algorithm start to matter a lot (I didn't do any benchmarking so I can't tell how big the difference will be in case of libdbmx).

There is Quadsort which is even a tiny bit faster than quicksort with identical algorithmic properties except for stability where Quadsort is stable.

For low-memory embedded systems there is a fixed-mem variant of Quadsort called Blitsort.

Btw. there are other gems like https://github.com/gerben-s/quicksort-blog-post/blob/master/blog.md but they're mostly not stable :cry:.

erthink commented 3 years ago

It's great that you want to help me improve something in libmdbx. Nonetheless, if you want to play these games with useful, then you should (not be naive, but) look deeper and more carefully into libmdbx.

There are three variants of sorting inside libmdbx:

Sorting of page number lists, i.e. sorting of unique 32-bit integers.
Sorting of dirty page list, i.e. sorting of short structs with unique 32-bit integer keys.
Partial relaxed sorting by "weight" for "first N" of subset dirty pages to spilling/outing, i.e. partial-sort with extraction from a set by functional/calculated keys.

The third case uses custom ad-hoc algo based on radixsort. This is a very cold code (i.e. it is executed extremely rarely), and the specifics handling is much more important than the speed of the sorting itself. I don't think that this can be improved, especially anything related to sorting, sine a structure of the algorithm is extremely close to the theoretical limit.

The first two cases uses:

for large (>= 333) chunks: radixsort with tricks to minimize iterations and avoid extra copying;
for small (<= 16) chunks: optimal sort network (aka bitonic sorter) with some tricks for CMOV and VLIW;
for other chunk sizes: state-of-the-art tricked quicksort.

There are known nothing could "beat" radix sort for large chunks and optimal sort network (depth-optimal) for small ones. So sorting in libmdbx can lose out only for specific chunks sizes if quadsort will be faster than my implementation based on quicksort with all the tricks used. I will play with quadsort (for instance with this simple benchmark of sorting). However, (preliminary) for now I don't see any reasons to libmdbx's sorting be noticeably slower than any other sorting (in the libmdbx's internal use cases).

A recursive merge sort isn't used, but a single-pass merge (with only a single branch on modern CPUs) is used for lazy sort of unsorted tail of dirty page list. I.e. if we have a unsorted tail which is ≈four times less than sorted part of a list, and a whole list size is not reasonable for radix sort, then we sort the tail and merge it with sorted part at be begin of dirty pages list. This is known to be the best solution, provided that the thresholds (for switching to radix sort and using sorting of tail with merging instead of full sorting) are optimally selected. The optimal choice of such thresholds is quite a difficult task, since it fundamentally depends on the characteristics of a target platform, quality of compiler's optimization and the nature of a data (the ratio of the cost of comparison to movement ones). So we should hardcode some not-a-bad/suboptimal values, either run a benchmark to choice ones during configure the library build.

In addition to the above, it is worth noting that it is very likely that in libmdbx it is reasonable reducing the sorting-related thresholds from 16 to 12. For instance, seems the network sort width should be limited to 12 which reduces network deep from 9 to 8. This will reduce the size of the code, but (presumably) will not have a noticeable effect on the sorting speed.

dumblob commented 3 years ago

Wow, thanks a lot for taking the time and describing in depth how libmdbx approaches sorting as of now.

Yes I've looked at the code just extremely briefly and this was the first idea which popped in my head. Of course it's all about real benchmarks and not synthetic ones. I just wanted to share those links as I find them kind of matching your goals to build a very fast DB.

Feel free to close this if you think there are no actionable items (though I'd very much like to see some real measurement results if you'll try e.g. quadsort or faster than usual bitonic sorters bubble sort for small inputs in libmdbx).

erthink commented 3 years ago

... faster than usual bitonic sorters bubble sort for small inputs ...

Just for clarity:

Main good property of bubble sort in comparison to binotic sort (the kind of sorting network) is it can exit/stops early. So some kinds of bubble sort may be a fastest sorting, especially for a sorted data ;)
Contrary, sorting network stands for a fixed number of comparisons and could be branch-less using CMOV instruction. Moreover, an several kind of (sub)optimal sorting networks can be constructed which performs a minimal number of operations, maximal number of operations in parallels, or with an ability of early exit like an unrolled buble sort.
So sorting network could beat any kind of buble sort, but it should be properly constructed for target platform (for the particular number of registers and number of parallel executed operations).

MDBX currently uses well-known optimal sorting networks (with the 3-16 entries), which are targeting CPU with large register file and many of execution units. This is good for E2K (256 registers, 8 ops/clk) and not-bad fpr modern x86_64 CPUs (32-64 registers, 4-6 ops/clk).

dumblob commented 3 years ago

This is good for E2K (256 registers, 8 ops/clk) and not-bad fpr modern x86_64 CPUs (32-64 registers, 4-6 ops/clk).

Interesting - didn't know libmdbx is being tested on Elbrus CPUs. Does it run somewhere in production? If you have any links, please post them, thanks!

erthink commented 3 years ago

No significant differences. It just works.

user@elbrus ~ $ lscpu
Архитектура:         e2k
Порядок байт:        Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  1
Ядер на сокет:       8
Сокетов:             1
NUMA node(s):        1
ID прроизводителя:   Elbrus-MCST
Семейство ЦПУ:       5
Модель:              2
Имя модели:          E8C2
CPU MHz:             1500
BogoMIPS:            3000.00
L1d cache:           64K
L1i cache:           128K
L2 cache:            512K
L3 cache:            16384K
L3 cache:            16384K
NUMA node0 CPU(s):   0-7

user@elbrus ~ $ cc --version
lcc:1.25.17:May-16-2021:e2k-v5-linux
gcc (GCC) 7.3.0 compatible

user@elbrus ~ $ cd libmdbx/
user@elbrus libmdbx $ make test
// TIP: Use `make V=1` for verbose.
// TIP: Clone and build the https://github.com/pmwkaa/ioarena.git within a neighbouring directory for availability of benchmarking.
  MDBX_BUILD_OPTIONS   = -DNDEBUG=1
  MDBX_BUILD_TIMESTAMP = 2021-09-06T14:29:12+0300
// TIP: Use `make options` to listing available build options.
  CC       =/usr/bin/cc | lcc:1.25.17:May-16-2021:e2k-v5-linux
  CFLAGS   =-std=gnu11 -O2 -g -Wall -Werror -Wextra -Wpedantic -ffunction-sections -fPIC -fvisibility=hidden -pthread -Wno-error=attributes 
  CXXFLAGS =-std=gnu++2a -O2 -g -Wall -Werror -Wextra -Wpedantic -ffunction-sections -fPIC -fvisibility=hidden -pthread -Wno-error=attributes
  LDFLAGS  =-Wl,--gc-sections,-z,relro,-O1 -lm -lrt -pthread
// TIP: Use `make help` to listing available targets.
  CLEAN for build with specified flags... Ok
  MAKE src/version.c
  MAKE src/config.h
  CC mdbx-static.o
  CC mdbx++-static.o
  AR libmdbx.a
  CC mdbx-dylib.o
  CC mdbx++-dylib.o
  LD libmdbx.so
  CC+LD mdbx_stat
  CC+LD mdbx_copy
  CC+LD mdbx_dump
  CC+LD mdbx_load
  CC+LD mdbx_chk
  CC+LD mdbx_drop
  CC+LD mdbx_example
  CC test/osal-unix.o
  CC test/hill.o
  CC test/keygen.o
  CC test/copy.o
  CC test/nested.o
  CC test/jitter.o
  CC test/main.o
  CC test/test.o
  CC test/cases.o
  CC test/try.o
  CC test/ttl.o
  CC test/config.o
  CC test/append.o
  CC test/utils.o
  CC test/log.o
  CC test/chrono.o
  CC test/dead.o
  LD mdbx_test
  RUNNING `test/long_stochastic.sh --loops 2`...
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
...
user@elbrus libmdbx $

erthink / libmdbx

Sorting algorithms for even better performance #220

There are three variants of sorting inside libmdbx: