Possible improvements to grouping and sorting performance

group_by_key

The performance of group_by_key sometimes suffers on multi-socket machines, despite scaling fine on a single socket. Its implementation could possibly be improved. group_by_key_sorted does not seem to have this issue, so perhaps it could be used as a fallback to make group_by_key faster.

sample_sort

The example implementation of sample sort in the example code here is sometimes (but not always) faster than the actual sample sort implementation in the library, despite being substantially simpler. We could improve the library implementation by borrowing some of the example code.

cmuparlay / parlaylib

Possible improvements to grouping and sorting performance #56

group_by_key

sample_sort