Morwenn / cpp-sort

Sorting algorithms & related tools for C++14
MIT License
617 stars 57 forks source link

Audit the library for branchless comparisons and projections #177

Closed Morwenn closed 3 years ago

Morwenn commented 3 years ago

The type traits is_probably_branchless_comparison and is_probably_branchless_projection are currently used to assert that a comparison or a projection are most likely branchless. So far there are relevant specializations for the standard library comparators as well as for the function objects from <utility/functional.h> and <utility/comparators/*.h>, but the library uses a lot of custom comparison and projection objects in its implementation details, and those do not have specializations for the traits.

Currently a single algorithm benefits from those traits: pdqsort. However the branchless partitioning algorithm used by pdqsort is pretty powerful, and pdqsort is used as the default random-access sort used by a lot of algorithms that need a fallback. This means that ensuring that the library plays smoothly with the branchless information might have positive repercussions on the following components:

For algorithms that only ever handle built-in types such as spread_sorter, it might make a huge difference in some cases. The other case where I expect noticeable improvements is with adapters that need to tweak comparison or projection functions without making them branchful. A prime thing to test is the speed difference when using pointer to data members: those are always considered branchless, so sorting algorithms that take projections such as &Foobar::size could see big improvements for virtually no cost - this specific case highlights that the return type of utility::as_function is surely a major component to target.

Morwenn commented 3 years ago

This is the difference between calling pdq_sort on a std::vector with 10^7 element with a pointer to a long double data member wrapped in utility::as_function as a projection:

patterns benchmarks over pdq_sort 1.8.1 vs. 1.9.0

We can see that the results are widely different between the versions, which reflects the difference in the partitioning scheme picked up when using std::mem_fn(/* pointer to fundamental data member */). The slower patterns are a bit troubling, but the "shuffled" case is optimized, which is what matters the most here.